treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

Directory extraction task in workflow extracts unexpected items #1816

Open ishizuka-mihoko opened 9 months ago

ishizuka-mihoko commented 9 months ago

I have a question about Digdag workflows. In a .dig file, I have the following configuration:

_export:
  plugin:
    repositories:
      - https://jitpack.io/
    dependencies:
      - com.github.takemikami:digdag-plugin-shresult:0.0.3

+find_dirs:
  sh_result>: |
    find ./ -maxdepth 1 -type d -exec basename {} \; | grep -v "^.$" | grep -E '^\[' | sort -u | tr '\n' ',' | sed 's/,$//'
  destination_variable: dirs
  stdout_format: text

+call_dig:
  for_each>:
    dir: "${dirs.split(',')}"
  _parallel: true
  _do:
    +loop_dig:
      call>: ${dir}/unload.dig

In the ‘+find_dirs’ task, I’m attempting to extract directory names that start with ‘[’ and store them in the ‘dirs’ variable. However, during Digdag execution, there are instances where items other than directory names starting with ‘[’ are being stored in ‘dirs’. Is this potentially a bug in Digdag? I would appreciate your confirmation. Here are the ‘dirs’ values that I want to extract and the ‘dirs’ values that were mistakenly extracted in this operation:

Desired ‘dirs’: dirs: [foo1]bar1,[foo2]bar2,[foo3]bar3,[foo4]bar4

‘Dirs’ extracted in the current operation (Example 1: Code from the executed workflow’s .dig file is being extracted):

dirs: >+ 
2023-09-14 21:45:48 +0000 [INFO] (1913@[0:test_wf]+test_coordinator+call_dig^sub^sub+find_dirs) io.digdag.core.agent.OperatorManager: sh_result>: find ./ -maxdepth 1 -type d -exec basename {} \; | grep -v "^.$" | grep -E '^\[' | sort -u | tr '\n' ',' | sed 's/,$//' 

‘Dirs’ extracted in the current operation (Example 2: Logs from other projects executed at the same time are being extracted):

dirs: >- 
2023-09-07 09:15:02 +0000 [INFO] (5117@[0:cost_alert]+cost_change+scripts+notice) io.digdag.core.agent.OperatorManager: sh>: cd cost_alert python3 [main.py](http://main.py/) -t increase [foo1]bar1,[foo2]bar2,[foo3]bar3,[foo4]bar4
hiroyuki-sato commented 9 months ago

Hello, @ishizuka-mihoko

It is necessary to write the detail about your environment and reproduce steps for investigate the issue. I tried to reproduce the problem. But It seems work well in my environment.

sh_reslut> operator doesn't the project in digdag. It's a 3rd party plugin.

find * -type f -print
[foo1]/unload.dig
[foo2]/unload.dig
[foo3]/unload.dig
hoge/unload.dig
test.dig
_export:
  plugin:
    repositories:
      - https://jitpack.io/
    dependencies:
      - com.github.takemikami:digdag-plugin-shresult:0.0.3

+find_dirs:
  sh_result>: |
    find . -maxdepth 1 -type d -exec basename {} \; | grep -v "^.$" | grep -E '^\[' | sort -u | tr '\n' ',' | sed 's/,$//'
  destination_variable: dirs
  stdout_format: text

+call_dig:
  for_each>:
    dir: "${dirs.split(',')}"
  _parallel: true
  _do:
    +loop_dig:
      call>: ${dir}/unload.dig
cat */*.dig
+tasks1:
  echo>: foo1/unload.dig
+tasks1:
  echo>: foo2/unload.dig
+tasks1:
  echo>: foo3/unload.dig
+tasks1:
  echo>: hoge/unload.dig
digdag run -a test.dig
2023-09-22 00:09:24 +0900: Digdag v0.10.5
2023-09-22 00:09:24 +0900 [WARN] (main): Reusing the last session time 2023-09-21T00:00:00+00:00.
2023-09-22 00:09:24 +0900 [INFO] (main): Using session /private/tmp/hoge/.digdag/status/20230921T000000+0000.
2023-09-22 00:09:24 +0900 [INFO] (main): Starting a new session project id=1 workflow name=test session_time=2023-09-21T00:00:00+00:00
2023-09-22 00:09:25 +0900 [INFO] (0018@[0:default:1:1]+test+find_dirs): sh_result>: find . -maxdepth 1 -type d -exec basename {} \; | grep -v "^.$" | grep -E '^\[' | sort -u | tr '\n' ',' | sed 's/,$//'

2023-09-22 00:09:25 +0900 [INFO] (0018@[0:default:1:1]+test+call_dig): for_each>: {dir=["[foo1]","[foo2]","[foo3]"]}
2023-09-22 00:09:25 +0900 [INFO] (0018@[0:default:1:1]+test+call_dig^sub+for-0=dir=0=%5Bfoo1%5D+loop_dig): call>: [foo1]/unload.dig
2023-09-22 00:09:25 +0900 [INFO] (0020@[0:default:1:1]+test+call_dig^sub+for-0=dir=1=%5Bfoo2%5D+loop_dig): call>: [foo2]/unload.dig
2023-09-22 00:09:25 +0900 [INFO] (0021@[0:default:1:1]+test+call_dig^sub+for-0=dir=2=%5Bfoo3%5D+loop_dig): call>: [foo3]/unload.dig
2023-09-22 00:09:25 +0900 [INFO] (0018@[0:default:1:1]+test+call_dig^sub+for-0=dir=1=%5Bfoo2%5D+loop_dig^sub+tasks1): echo>: foo2/unload.dig
foo2/unload.dig
2023-09-22 00:09:25 +0900 [INFO] (0018@[0:default:1:1]+test+call_dig^sub+for-0=dir=2=%5Bfoo3%5D+loop_dig^sub+tasks1): echo>: foo3/unload.dig
foo3/unload.dig
2023-09-22 00:09:25 +0900 [INFO] (0018@[0:default:1:1]+test+call_dig^sub+for-0=dir=0=%5Bfoo1%5D+loop_dig^sub+tasks1): echo>: foo1/unload.dig
foo1/unload.dig
Success. Task state is saved at /private/tmp/hoge/.digdag/status/20230921T000000+0000 directory.
  * Use --session <daily | hourly | "yyyy-MM-dd[ HH:mm:ss]"> to not reuse the last session time.
  * Use --rerun, --start +NAME, or --goal +NAME argument to rerun skipped tasks.
ishizuka-mihoko commented 9 months ago

Hello, @hiroyuki-sato Thank you for your comment. Here is my environment: digdag: 0.10.4 macOS: 12.6

I tried to reproduce the problem. But It seems work well in my environment.

In my environment too, it generally works well, but occasionally, I observe behavior where it extracts incorrect information.

sh_reslut> operator doesn't the project in digdag. It's a 3rd party plugin.

There could be a possibility that the issue is related to the plugin itself. Depending on the situation, I might consider reaching out to the plugin's support for assistance.