treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.31k stars 221 forks source link

ECS Command Executor fails if task continues for 10 minutes or more. #1756

Closed rariyama closed 1 year ago

rariyama commented 2 years ago

I encountered the issue when I executed task which can last for 10 minutes or more on ECS Command Executor. The file named as archive-output.tar.gz should be put on S3 after execution is finished but it actually wasn't. I experienced this issue when executing python and shell operator. Python was failed but shell was succeeded because python checks whether file exists or not.

Log and sample resources like workflow are following(Parts of values are masked).

Log outputted on Cloudwatch.

<Error><Code>AccessDenied</Code><Message>Request has expired</Message><X-Amz-Expires>599</X-Amz-Expires><Expires>2022-07-09T12:57:08Z</Expires><ServerTime>2022-07-09T12:57:53Z</ServerTime><RequestId>123456789</RequestId><HostId>123456789=</HostId></Error>--123456789--

The python operator log outputted by executing digdag log command

2022-07-09 21:58:38 +0900 [ERROR] (0020@[0:default]+Example): Task failed, retrying
io.digdag.spi.TaskExecutionException: java.lang.RuntimeException: output.json does not exist. Something unexpected error happened. Please check logs.
        at io.digdag.spi.TaskExecutionException.ofNextPollingWithCause(TaskExecutionException.java:85)
        at io.digdag.util.BaseOperator.run(BaseOperator.java:56)
        at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:399)
        at io.digdag.cli.Run$OperatorManagerWithSkip.callExecutor(Run.java:709)
        at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:308)
        at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:152)
        at io.digdag.core.agent.LocalWorkspaceManager.withExtractedArchive(LocalWorkspaceManager.java:25)
        at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:150)
        at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:133)
        at io.digdag.cli.Run$OperatorManagerWithSkip.run(Run.java:691)
        at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:132)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: output.json does not exist. Something unexpected error happened. Please check logs.
        at io.digdag.standards.operator.PyOperatorFactory$PyOperator.runCode(PyOperatorFactory.java:186)
        at io.digdag.standards.operator.PyOperatorFactory$PyOperator.runTask(PyOperatorFactory.java:115)
        at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
        ... 14 common frames omitted

The example workflow.

+Example:
  +Py:
    _export:
      ecs:
        task_definition_arn: "arn:aws:ecs:ap-northeast-1:123456789:task-definition/example"
    py>: example.Sample.execute # a process which can last for longer time.

example.py

import time

class Sample:
    def execute(self):
        time.sleep(600) 

The example digdag run command digdag run ./Example.dig +Example+Py --session-time $(date +'%Y-%m-%d')

The settings on property file.

agent.command_executor.type=ecs
agent.command_executor.ecs.name=Example
agent.command_executor.ecs.Example.access_key_id=SampleKey
agent.command_executor.ecs.Example.secret_access_key=SampleSecret
agent.command_executor.ecs.Example.launch_type=FARGATE
agent.command_executor.ecs.Example.region=ap-northeast-1
agent.command_executor.ecs.Example.subnets=SampleSubnet
agent.command_executor.ecs.Example.max_retries=10

agent.command_executor.ecs.temporal_storage.type=s3
agent.command_executor.ecs.temporal_storage.s3.bucket=SampleBucket
agent.command_executor.ecs.temporal_storage.s3.endpoint=s3.amazonaws.com
agent.command_executor.ecs.temporal_storage.s3.credentials.access-key-id=SampleKey
agent.command_executor.ecs.temporal_storage.s3.credentials.secret-access-key=SampleSecret

eval.js-engine-type=graal

I tried to search for the reason but I coudn't find. Please let me know the reason or the effective solution if you know.

rariyama commented 2 years ago

I could solve this problem by setting a value which is larger than 10 minutes to this variable. Specifically, I added the parameter to property file and set value larger than 600.

agent.command_executor.type=ecs
agent.command_executor.ecs.name=Example
agent.command_executor.ecs.Example.access_key_id=SampleKey
agent.command_executor.ecs.Example.secret_access_key=SampleSecret
agent.command_executor.ecs.Example.launch_type=FARGATE
agent.command_executor.ecs.Example.region=ap-northeast-1
agent.command_executor.ecs.Example.subnets=SampleSubnet
agent.command_executor.ecs.Example.max_retries=10

agent.command_executor.ecs.temporal_storage.type=s3
agent.command_executor.ecs.temporal_storage.s3.bucket=SampleBucket
agent.command_executor.ecs.temporal_storage.s3.endpoint=s3.amazonaws.com
agent.command_executor.ecs.temporal_storage.s3.credentials.access-key-id=SampleKey
agent.command_executor.ecs.temporal_storage.s3.credentials.secret-access-key=SampleSecret
agent.command_executor.ecs.temporal_storage.s3.direct_upload_expiration=660 # longer than 600

eval.js-engine-type=graal

I think it's better to describe it on the document about ECS Command Executor because this value has an effect on the processes which are possible to continue for 10 minutes or more. If my offer is promising I'll create PR to add the description about it.