rabix / bunny

[Legacy] Executor for CWL workflows. Executes sbg:draft-2 and CWL 1.0
http://rabix.io
Apache License 2.0
74 stars 28 forks source link

bunny with TES backend does not handle scatter #382

Open kmhernan opened 6 years ago

kmhernan commented 6 years ago

When running with the TES backend (funnel) using a workflow that has scatter nothing seems to actually be submitted to the TES. I have provided the DEBUG log of the run. After the last line bunny just seems to hang.

[2017-11-09 17:28:03.804] [DEBUG] DAGCache initialized with size=16
[2017-11-09 17:28:03.816] [DEBUG] ApplicationCache initialized with size=16
[2017-11-09 17:28:04.268] [INFO] Backend f5ed244a-da69-4067-bcb7-859d7e70e613 registered.
[2017-11-09 17:28:04.291] [DEBUG] Start Job Job [id=null, parentId=null, rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, name=null, status=PENDING, message=null, config=null, inputs={terms=[hello01, hello02, hello03, hello04]}, outputs=null]
[2017-11-09 17:28:04.449] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.457] [INFO] Job null rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 started
[2017-11-09 17:28:04.459] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.524] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.524] [DEBUG] Cache size=1
[2017-11-09 17:28:04.527] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.528] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.529] [DEBUG] Cache size=1
[2017-11-09 17:28:04.529] [DEBUG] JobRecord root. Decrementing port terms.
[2017-11-09 17:28:04.529] [DEBUG]
Job root input counters:
 -- Input port terms, counter=0

[2017-11-09 17:28:04.529] [DEBUG]
Job root output counters:

[2017-11-09 17:28:04.529] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.530] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.530] [DEBUG] Cache size=1
[2017-11-09 17:28:04.530] [DEBUG]  --- JobRecord root is ready. Job isBlocking=false
 --- Input port terms, isScatter=false, isBlocking false
 --- All scatter ports []

[2017-11-09 17:28:04.531] [DEBUG]
 -- JobRecord root.run_load_contents, isBlocking false
 -- Input port term, isScatter true
 -- Output port json_file, isScatter false
 -- Output port out_value, isScatter false

[2017-11-09 17:28:04.534] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.534] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.534] [DEBUG] Cache size=1
[2017-11-09 17:28:04.534] [DEBUG] JobRecord root.run_load_contents. Decrementing port term.
[2017-11-09 17:28:04.534] [DEBUG]
Job root.run_load_contents input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.534] [DEBUG]
Job root.run_load_contents output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.535] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.535] [DEBUG] Cache size=1
[2017-11-09 17:28:04.537] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.537] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.537] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.537] [DEBUG] Cache size=1
[2017-11-09 17:28:04.537] [DEBUG] JobRecord root.run_load_contents.1. Decrementing port term.
[2017-11-09 17:28:04.538] [DEBUG]
Job root.run_load_contents.1 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.538] [DEBUG]
Job root.run_load_contents.1 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.538] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.538] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.538] [DEBUG] Cache size=1
[2017-11-09 17:28:04.538] [DEBUG]  --- JobRecord root.run_load_contents.1 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.538] [DEBUG]
 -- JobRecord root.run_load_contents.1.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.539] [DEBUG]
 -- JobRecord root.run_load_contents.1.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.540] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.540] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.541] [DEBUG] Cache size=1
[2017-11-09 17:28:04.541] [DEBUG] JobRecord root.run_load_contents.1.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.541] [DEBUG]
Job root.run_load_contents.1.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.541] [DEBUG]
Job root.run_load_contents.1.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.542] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.542] [DEBUG] Cache size=1
[2017-11-09 17:28:04.568] [DEBUG]
 ---- JobRecord root.run_load_contents.1

[2017-11-09 17:28:04.568] [INFO] Container job root.run_load_contents.1 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.568] [DEBUG] onJobRootReady(jobId=18b4980c-72e8-40aa-9c3a-7e41bfd73a06)
[2017-11-09 17:28:04.569] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.569] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.569] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.569] [DEBUG] Cache size=1
[2017-11-09 17:28:04.569] [DEBUG] JobRecord root.run_load_contents.2. Decrementing port term.
[2017-11-09 17:28:04.569] [DEBUG]
Job root.run_load_contents.2 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.569] [DEBUG]
Job root.run_load_contents.2 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.569] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.570] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.570] [DEBUG] Cache size=1
[2017-11-09 17:28:04.570] [DEBUG]  --- JobRecord root.run_load_contents.2 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.570] [DEBUG]
 -- JobRecord root.run_load_contents.2.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.571] [DEBUG]
 -- JobRecord root.run_load_contents.2.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.572] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.572] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.572] [DEBUG] Cache size=1
[2017-11-09 17:28:04.572] [DEBUG] JobRecord root.run_load_contents.2.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.572] [DEBUG]
Job root.run_load_contents.2.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.572] [DEBUG]
Job root.run_load_contents.2.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.572] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.573] [DEBUG] Cache size=1
[2017-11-09 17:28:04.574] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

[2017-11-09 17:28:04.574] [INFO] Container job root.run_load_contents.2 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.575] [DEBUG] onJobRootReady(jobId=e297562f-5384-44f4-b47a-045a84e70283)
[2017-11-09 17:28:04.575] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.575] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.576] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.576] [DEBUG] Cache size=1
[2017-11-09 17:28:04.576] [DEBUG] JobRecord root.run_load_contents.3. Decrementing port term.
[2017-11-09 17:28:04.576] [DEBUG]
Job root.run_load_contents.3 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.576] [DEBUG]
Job root.run_load_contents.3 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.576] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.576] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.576] [DEBUG] Cache size=1
[2017-11-09 17:28:04.576] [DEBUG]  --- JobRecord root.run_load_contents.3 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.577] [DEBUG]
 -- JobRecord root.run_load_contents.3.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.577] [DEBUG]
 -- JobRecord root.run_load_contents.3.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.578] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.579] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.579] [DEBUG] Cache size=1
[2017-11-09 17:28:04.579] [DEBUG] JobRecord root.run_load_contents.3.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.579] [DEBUG]
Job root.run_load_contents.3.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.579] [DEBUG]
Job root.run_load_contents.3.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.579] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.579] [DEBUG] Cache size=1
[2017-11-09 17:28:04.582] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

 ---- JobRecord root.run_load_contents.3

[2017-11-09 17:28:04.583] [INFO] Container job root.run_load_contents.3 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.583] [DEBUG] onJobRootReady(jobId=d93de3c5-5568-4f2a-9423-0ebdebdd5486)
[2017-11-09 17:28:04.584] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.584] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.584] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.584] [DEBUG] Cache size=1
[2017-11-09 17:28:04.585] [DEBUG] JobRecord root.run_load_contents.4. Decrementing port term.
[2017-11-09 17:28:04.585] [DEBUG]
Job root.run_load_contents.4 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.585] [DEBUG]
Job root.run_load_contents.4 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.585] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.586] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.586] [DEBUG] Cache size=1
[2017-11-09 17:28:04.586] [DEBUG]  --- JobRecord root.run_load_contents.4 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.586] [DEBUG]
 -- JobRecord root.run_load_contents.4.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.587] [DEBUG]
 -- JobRecord root.run_load_contents.4.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.589] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.589] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.589] [DEBUG] Cache size=1
[2017-11-09 17:28:04.590] [DEBUG] JobRecord root.run_load_contents.4.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.590] [DEBUG]
Job root.run_load_contents.4.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.590] [DEBUG]
Job root.run_load_contents.4.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.591] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.591] [DEBUG] Cache size=1
[2017-11-09 17:28:04.592] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

 ---- JobRecord root.run_load_contents.3

 ---- JobRecord root.run_load_contents.4

[2017-11-09 17:28:04.593] [INFO] Container job root.run_load_contents.4 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.593] [DEBUG] onJobRootReady(jobId=bea60dc2-be00-456d-9fb5-8dd00f7b95f9)
[2017-11-09 17:28:04.593] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.593] [DEBUG] Cache size=1
[2017-11-09 17:28:04.613] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

 ---- JobRecord root.run_load_contents.3

 ---- JobRecord root.run_load_contents.4

 ---- JobRecord root

[2017-11-09 17:28:04.614] [INFO] Container job root rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.614] [DEBUG] onJobRootReady(jobId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173)
[2017-11-09 17:28:04.614] [DEBUG] org.rabix.engine.store.memory.impl.InMemoryJobRecordRepository@6052b831 flushed 14 item(s). Cache hits 53.
[2017-11-09 17:28:04.615] [DEBUG] org.rabix.engine.store.memory.impl.InMemoryVariableRecordRepository@45de06b0 flushed 38 item(s). Cache hits 25.
[2017-11-09 17:28:04.615] [DEBUG] org.rabix.engine.store.memory.impl.InMemoryLinkRecordRepository@75c73d5e flushed 31 item(s). Cache hits 5.
milos-ljubinkovic commented 6 years ago

Can you upload the workflow file you tried? Thanks

kmhernan commented 6 years ago

Here is the packed version. The input json would look like:

{
    "terms": ["hello01", "hello02", "hello03", "hello04"]
}

scatter_test.packed.cwl.gz

milos-ljubinkovic commented 6 years ago

This is an issue we noticed recently that rabix has with starting jobs that have no defined input ports. I've created a branch with a quick fix and will be releasing a 1.0.4 prerelease binaries soon that will contain this and some other fixes.

(Also, rabix had some issues unpacking your wf, I will investigate and create a separate issue)

kmhernan commented 6 years ago

that's interesting cause I am pretty sure I ran the standalone tool with bunny, but maybe I was using a jar from that branch. Thanks for the info. Let me know if the issue with the wf is my own error and I can patch it. I do know the creating a file with workdir and loading its contents does work with rabix/funnel, but once I added in the scatter is when I ran into problems.

kmhernan commented 6 years ago

Ok @milos-ljubinkovic so I tried running the unpacked workflow with some slight adjustments to not have any null inputs or baseCommands; however I get a jackson-based exception (full logs are attached as bugfix.noinputs.orig_jackson.logs.gz.

java.util.concurrent.ExecutionException: java.lang.IllegalStateException: com.fasterxml.jackson.databin
 at [Source: N/A; line: -1, column: -1]

So I rebuilt bunny with updated jackson (2.9.2) and I no longer get that error, but instead it is an error about the engine not being able to glob the output file (my current use case doesn't have a shared fs but does use SLURM). I have attached these logs as bugfix.noinputs.updated_jackson.logs.gz. Now what is interesting is that I can run the same workflow that doesn't use scatter and it will work. To be helpful, I have attached a tar.gz containing the workflows. The scatter_test.cwl is the scattered one and an example input json is scatter.inputs.json. The contents_workflow.cwl is the workflow that the scatter_test.cwl workflow scatters and it completes successfully using the basic.inputs.json file.

bugfix.noinputs.orig_jackson.logs.gz bugfix.noinputs.updated_jackson.logs.gz test_cwl.tar.gz

Edit: Just noting that I added some silly meaningless inputs and baseCommand (e.g., ls) to see if that was an issue since I had them as null more or less

milos-ljubinkovic commented 6 years ago

error about the engine not being able to glob the output file

Could you check the funnel logs after this collection fails? I'm interested where are the output files uploaded. This might be a bug where rabix is looking for them in the wrong folder or something.

There should be something like this in funnel logs (might depend on the log-level in your funnel config):

local                Starting upload
attempt              0
index                0
taskID               b84p40blt51hgl189heg
time                 2017-11-13T13:37:28+01:00
timestamp            2017-11-13T12:37:28.042344Z
url                  file:///tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid

....

local                TASK_OUTPUTS
attempt              0
index                0
outputs              []*tes.OutputFileLog{
    &tes.OutputFileLog{Url:"/tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/job.err.log", Path:"/Users/milosljubinkovic/Downloads/funnel-work-dir/b84p40blt51hgl189heg/Users/milosljubinkovic/Downloads/test_cwl/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/job.err.log", SizeBytes:0},
    &tes.OutputFileLog{Url:"/tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/uuid.txt", Path:"/Users/milosljubinkovic/Downloads/funnel-work-dir/b84p40blt51hgl189heg/Users/milosljubinkovic/Downloads/test_cwl/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/uuid.txt", SizeBytes:36},
}

Might be that for some reason the url param in the "Starting upload" log part isn't in the directory that is defined as rabix.tes.storage.base in rabix config or that "uuid.txt" in the "TASK_OUTPUTS" log hasn't been created.

buchanae commented 6 years ago

I haven't fully grokked the issue here, but wanted to note that we recently had a bug with the output file log in Funnel, https://github.com/ohsu-comp-bio/funnel/issues/290 . Not sure it applies here.

kmhernan commented 6 years ago

@milos-ljubinkovic could this be due to the fact that our SLURM cluster in running on nodes without a shared fs? Within each node the rabix.tes.storage.base is there, but across nodes it is obviously not the same due to it not being on a shared fs?

milos-ljubinkovic commented 6 years ago

I don't know how funnel handles that situation. We just declare a shared directory (rabix.tes.storage.base) between funnel server and rabix executor because rabix has to do some postprocessing on the outputs after execution, no idea how data transfer between slurm nodes and funnel server works.

If funnel skipped over or had an error during the upload this is the issue, but if it succeeded in uploading outputs (anywhere, maybe a different dir. that rabix.tes.storage.base) after each slurm node completed then it's is a different one.

buchanae commented 6 years ago

If you set rabix.tes.storage.base to a bucket storage location, e.g. s3://rabix-bucket/shared, is rabix able to use that?

kmhernan commented 6 years ago

I'm having all sorts of issues with s3 backend using the branch of bunny that I have been working on. I did go through and add the META-INF directory to get rid of the provider errors, but it still isn't able to communicate. It is important to note that this is a cleversafe s3 store and not aws.

milos-ljubinkovic commented 6 years ago

I think I never got the s3 urls to work in funnel when they contain access points so rabix just defaults to aws' default s3 host.

@buchanae Have you tested s3 support with other hosts and what should the urls look like? Might be I just used the wrong form.

kmhernan commented 6 years ago

@milos-ljubinkovic it seems like bunny is adding in a null to the uri for the s3

local                System error
attempt              0
error                Output upload not supported by storage: name:"run_get_uuid" url:"s3://null/bioinformatics_scratch/kmhernan/bunny/28fbf43e-6d2e-4da0-858f-44e5ec4f3888/root/run_load_contents/3/run_get_uuid" path:"/home/ubuntu/funnel-work-dir/b85ocllv48vjfgd4hsjg/home/ubuntu/kmhernan/custom_rabix/28fbf43e-6d2e-4da0-858f-44e5ec4f3888/root/run_load_contents/3/run_get_uuid" type:DIRECTORY
index                0
taskID               b85ocllv48vjfgd4hsjg
time                 2017-11-15T00:12:07Z
timestamp            2017-11-15T00:12:07.019349748Z
buchanae commented 6 years ago

Funnel doesn't support endpoints for S3 yet, and we're not sure Funnel's S3 client will work outside of AWS. https://github.com/ohsu-comp-bio/funnel/issues/338

kmhernan commented 6 years ago

@buchanae do you know if the adding of the null to the URI is from the funnel side ? e.g., s3://null/

adamstruck commented 6 years ago

@kmhernan funnel does not modify the URIs that are given to it. That null must be coming from bunny.

milos-ljubinkovic commented 6 years ago

Bunny is adding that because I've hastily edited the s3 lib I use to generate urls without endpoints that funnel supports. When funnel gets support for endpoints I'll revert it back.

In the meantime I guess I could remove the null but funnel will still search for the file on the default amazon aws endpoint so it wouldn't matter in this case.

If you have files that are on amazon aws and declare them in task inputs without the endpoint it should be sent to funnel without null.

adamstruck commented 6 years ago

@milos-ljubinkovic @kmhernan I am working on extending our s3 support. I am hoping to get a PR up by tomorrow that you could test.

adamstruck commented 6 years ago

@kmhernan @milos-ljubinkovic -> https://github.com/ohsu-comp-bio/funnel/pull/356

As I note in the PR there are still some outstanding issues I need to address. But this should unblock further testing / development.

kmhernan commented 6 years ago

great @adamstruck .. I just want to let you know @milos-ljubinkovic feel free to tag me in something to test on our ceph/cleversafe s3 stores when you get a chance

adamstruck commented 6 years ago

@milos-ljubinkovic funnel was updated to use the new TES v0.3 spec recently. To use these new features we will need to update the TES model in bunny. Let me know if you would like help with that.

kmhernan commented 6 years ago

@milos-ljubinkovic i made some changes locally to test the newest funnel release and attached the diffs here. Pretty basic, but no hardcore testing. tes_update_diffs.txt

milos-ljubinkovic commented 6 years ago

I've updated the bunny TES models to v0.3 and I've been trying to test the generic s3 support from your pull request branch, but I can't get the urls with endpoints to work.

Is the config and file url supposed to look like this or do I have the wrong format:

Worker:
  Storage:
    S3:
      AWS:
        Endpoint: s3.amazonaws.com
        Key: key
        Secret: secret

"url": "s3://s3.amazonaws.com/testbunny/input.txt"

I can generate file urls without the endpoint and it works, but is this a valid solution?

adamstruck commented 6 years ago

The config isn't correct. Please look at the new docs:

https://github.com/adamstruck/funnel/blob/3370b1df2271886e56a9a883535aea196592f408/website/content/docs/storage/aws-s3.md

Sorry about the confusing config structure. It's something we are working on.

kmhernan commented 6 years ago

@adamstruck It seems like the S3 settings in the config file aren't propagated to the worker node...

This is the S3 section I see on the worker node which does not match my input config

    S3:
      FromEnv: false
      Key: ""
      Secret: ""
adamstruck commented 6 years ago

The config isn't propagated to the worker node unless there is a shared file system. You can put the config on the worker and modify the submit template to point to that file.

I'll see what I can do to make this more straight forward when I get back next week.

Also I'll need to investigate how your worker got that config structure since non of the storage backends in this branch use the 'fromEnv' keyword.

kmhernan commented 6 years ago

@adamstruck my test of funnel from your fork with my cleversafe endpoint was successful as long as I didn't have the http:// prefix to the endpoint. @milos-ljubinkovic for the url of the file I did not include the endpoint, e.g., my url was "s3://bioinformatics_scratch/kmhernan/meta.txt" I hope this helps

milos-ljubinkovic commented 6 years ago

Added support for s3 endpoints in the latest bunny release. The lib I used had some issues with google storage s3 due to differences in the request parameters, so not sure it will work with IBM's implementation.

adamstruck commented 6 years ago

Scatter seems to be working for me. Whether or not all tasks are started simultaneously is determined by the pool size which can now be configured (rabix.tes.task-thread-pool-size).