treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

"Text file busy" errors occur when running shell script via "sh>" #1830

Closed MxShun closed 2 months ago

MxShun commented 3 months ago

Description

When running shell script _cmd/batchapp.sh, "Text file busy" errors occur like shown below.

2024-04-04 23:50:00.958 +0000 [INFO] (0417@[0:batch:58677:58677]+batch.import+import) io.digdag.core.agent.OperatorManager: sh>: cmd/batch_app.sh import
.digdag/tmp/digdag-sh-122635-8641042193224341649/runner.sh: 1: .digdag/tmp/digdag-sh-122635-8641042193224341649/runner.sh: cmd/batch_app.sh: Text file busy
2024-04-04 23:50:01.067 +0000 [ERROR] (0417@[0:batch:58677:58677]+batch.import+import) io.digdag.core.agent.OperatorManager: Task failed with unexpected error: Command failed with code 2
java.lang.RuntimeException: Command failed with code 2
    at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runCode(ShOperatorFactory.java:121)
    at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:88)
    at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
    at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:399)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invokeMain(DigdagTimedMethodInterceptor.java:58)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invoke(DigdagTimedMethodInterceptor.java:31)
    at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:308)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invokeMain(DigdagTimedMethodInterceptor.java:58)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invoke(DigdagTimedMethodInterceptor.java:31)
    at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:152)
    at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:75)
    at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:150)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invokeMain(DigdagTimedMethodInterceptor.java:58)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invoke(DigdagTimedMethodInterceptor.java:31)
    at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:133)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invokeMain(DigdagTimedMethodInterceptor.java:58)
    at io.digdag.server.metrics.DigdagTimedMethodInterceptor.invoke(DigdagTimedMethodInterceptor.java:31)
    at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:132)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
timezone: Asia/Tokyo

schedule:
  minutes_interval>: 10
  skip_on_overtime: true

+import:
  sh>: cmd/batch_app.sh import

We typically have multiple workflows concurrently executing _batchapp.sh, but this issue intermittently occurs, meaning there is no reproducibility. Do you know the reasons?

Environment

Digdag: v0.10.5 JDK: openjdk:8u282-jre-slim OS: Linux x86_64

toru-takahashi commented 2 months ago

The message text file busy come from Linux when something is attempting to overwrite the executable for a running process which is using the same data. Digdag sh tasks in parallel is not safe for concurrent access, and overwriting running processes can result in unexpected behavior. It's outside of digdag system.

MxShun commented 2 months ago

@toru-takahashi Thank you for replying, and sorry I misunderstood. _batchapp.sh is not executed concurrently cause these executable files are copied to temporary directory. Therefore, same executable files (of same paths) don't run in different sessions, right?

toru-takahashi commented 2 months ago

For shell operator itself, yes. I'm not sure your shell script's code. So, I provided a general cause about the error. At least, I recommend you to add some logging in the shell script side, so you can identify where the error happens in the next time.

MxShun commented 2 months ago

@toru-takahashi I'll look it up by adding some logging, thank you!