Open kmhernan opened 6 years ago
Can you upload the workflow file you tried? Thanks
Here is the packed version. The input json would look like:
{
"terms": ["hello01", "hello02", "hello03", "hello04"]
}
This is an issue we noticed recently that rabix has with starting jobs that have no defined input ports. I've created a branch with a quick fix and will be releasing a 1.0.4 prerelease binaries soon that will contain this and some other fixes.
(Also, rabix had some issues unpacking your wf, I will investigate and create a separate issue)
that's interesting cause I am pretty sure I ran the standalone tool with bunny, but maybe I was using a jar from that branch. Thanks for the info. Let me know if the issue with the wf is my own error and I can patch it. I do know the creating a file with workdir and loading its contents does work with rabix/funnel, but once I added in the scatter is when I ran into problems.
Ok @milos-ljubinkovic so I tried running the unpacked workflow with some slight adjustments to not have any null inputs or baseCommands; however I get a jackson-based exception (full logs are attached as bugfix.noinputs.orig_jackson.logs.gz
.
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: com.fasterxml.jackson.databin
at [Source: N/A; line: -1, column: -1]
So I rebuilt bunny with updated jackson (2.9.2) and I no longer get that error, but instead it is an error about the engine not being able to glob the output file (my current use case doesn't have a shared fs but does use SLURM). I have attached these logs as bugfix.noinputs.updated_jackson.logs.gz
. Now what is interesting is that I can run the same workflow that doesn't use scatter and it will work. To be helpful, I have attached a tar.gz containing the workflows. The scatter_test.cwl
is the scattered one and an example input json is scatter.inputs.json
. The contents_workflow.cwl
is the workflow that the scatter_test.cwl
workflow scatters and it completes successfully using the basic.inputs.json
file.
bugfix.noinputs.orig_jackson.logs.gz bugfix.noinputs.updated_jackson.logs.gz test_cwl.tar.gz
Edit: Just noting that I added some silly meaningless inputs and baseCommand (e.g., ls
) to see if that was an issue since I had them as null
more or less
error about the engine not being able to glob the output file
Could you check the funnel logs after this collection fails? I'm interested where are the output files uploaded. This might be a bug where rabix is looking for them in the wrong folder or something.
There should be something like this in funnel logs (might depend on the log-level in your funnel config):
local Starting upload
attempt 0
index 0
taskID b84p40blt51hgl189heg
time 2017-11-13T13:37:28+01:00
timestamp 2017-11-13T12:37:28.042344Z
url file:///tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid
....
local TASK_OUTPUTS
attempt 0
index 0
outputs []*tes.OutputFileLog{
&tes.OutputFileLog{Url:"/tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/job.err.log", Path:"/Users/milosljubinkovic/Downloads/funnel-work-dir/b84p40blt51hgl189heg/Users/milosljubinkovic/Downloads/test_cwl/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/job.err.log", SizeBytes:0},
&tes.OutputFileLog{Url:"/tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/uuid.txt", Path:"/Users/milosljubinkovic/Downloads/funnel-work-dir/b84p40blt51hgl189heg/Users/milosljubinkovic/Downloads/test_cwl/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/uuid.txt", SizeBytes:36},
}
Might be that for some reason the url param in the "Starting upload" log part isn't in the directory that is defined as rabix.tes.storage.base
in rabix config or that "uuid.txt" in the "TASK_OUTPUTS" log hasn't been created.
I haven't fully grokked the issue here, but wanted to note that we recently had a bug with the output file log in Funnel, https://github.com/ohsu-comp-bio/funnel/issues/290 . Not sure it applies here.
@milos-ljubinkovic could this be due to the fact that our SLURM cluster in running on nodes without a shared fs? Within each node the rabix.tes.storage.base
is there, but across nodes it is obviously not the same due to it not being on a shared fs?
I don't know how funnel handles that situation. We just declare a shared directory (rabix.tes.storage.base) between funnel server and rabix executor because rabix has to do some postprocessing on the outputs after execution, no idea how data transfer between slurm nodes and funnel server works.
If funnel skipped over or had an error during the upload this is the issue, but if it succeeded in uploading outputs (anywhere, maybe a different dir. that rabix.tes.storage.base) after each slurm node completed then it's is a different one.
If you set rabix.tes.storage.base
to a bucket storage location, e.g. s3://rabix-bucket/shared
, is rabix able to use that?
I'm having all sorts of issues with s3 backend using the branch of bunny that I have been working on. I did go through and add the META-INF
directory to get rid of the provider errors, but it still isn't able to communicate. It is important to note that this is a cleversafe s3 store and not aws.
I think I never got the s3 urls to work in funnel when they contain access points so rabix just defaults to aws' default s3 host.
@buchanae Have you tested s3 support with other hosts and what should the urls look like? Might be I just used the wrong form.
@milos-ljubinkovic it seems like bunny is adding in a null
to the uri for the s3
local System error
attempt 0
error Output upload not supported by storage: name:"run_get_uuid" url:"s3://null/bioinformatics_scratch/kmhernan/bunny/28fbf43e-6d2e-4da0-858f-44e5ec4f3888/root/run_load_contents/3/run_get_uuid" path:"/home/ubuntu/funnel-work-dir/b85ocllv48vjfgd4hsjg/home/ubuntu/kmhernan/custom_rabix/28fbf43e-6d2e-4da0-858f-44e5ec4f3888/root/run_load_contents/3/run_get_uuid" type:DIRECTORY
index 0
taskID b85ocllv48vjfgd4hsjg
time 2017-11-15T00:12:07Z
timestamp 2017-11-15T00:12:07.019349748Z
Funnel doesn't support endpoints for S3 yet, and we're not sure Funnel's S3 client will work outside of AWS. https://github.com/ohsu-comp-bio/funnel/issues/338
@buchanae do you know if the adding of the null
to the URI is from the funnel side ? e.g., s3://null/
@kmhernan funnel does not modify the URIs that are given to it. That null
must be coming from bunny.
Bunny is adding that because I've hastily edited the s3 lib I use to generate urls without endpoints that funnel supports. When funnel gets support for endpoints I'll revert it back.
In the meantime I guess I could remove the null but funnel will still search for the file on the default amazon aws endpoint so it wouldn't matter in this case.
If you have files that are on amazon aws and declare them in task inputs without the endpoint it should be sent to funnel without null.
@milos-ljubinkovic @kmhernan I am working on extending our s3 support. I am hoping to get a PR up by tomorrow that you could test.
@kmhernan @milos-ljubinkovic -> https://github.com/ohsu-comp-bio/funnel/pull/356
As I note in the PR there are still some outstanding issues I need to address. But this should unblock further testing / development.
great @adamstruck .. I just want to let you know @milos-ljubinkovic feel free to tag me in something to test on our ceph/cleversafe s3 stores when you get a chance
@milos-ljubinkovic funnel was updated to use the new TES v0.3 spec recently. To use these new features we will need to update the TES model in bunny. Let me know if you would like help with that.
@milos-ljubinkovic i made some changes locally to test the newest funnel release and attached the diffs here. Pretty basic, but no hardcore testing. tes_update_diffs.txt
I've updated the bunny TES models to v0.3 and I've been trying to test the generic s3 support from your pull request branch, but I can't get the urls with endpoints to work.
Is the config and file url supposed to look like this or do I have the wrong format:
Worker:
Storage:
S3:
AWS:
Endpoint: s3.amazonaws.com
Key: key
Secret: secret
"url": "s3://s3.amazonaws.com/testbunny/input.txt"
I can generate file urls without the endpoint and it works, but is this a valid solution?
The config isn't correct. Please look at the new docs:
Sorry about the confusing config structure. It's something we are working on.
@adamstruck It seems like the S3 settings in the config file aren't propagated to the worker node...
This is the S3 section I see on the worker node which does not match my input config
S3:
FromEnv: false
Key: ""
Secret: ""
The config isn't propagated to the worker node unless there is a shared file system. You can put the config on the worker and modify the submit template to point to that file.
I'll see what I can do to make this more straight forward when I get back next week.
Also I'll need to investigate how your worker got that config structure since non of the storage backends in this branch use the 'fromEnv' keyword.
@adamstruck my test of funnel from your fork with my cleversafe endpoint was successful as long as I didn't have the http://
prefix to the endpoint. @milos-ljubinkovic for the url of the file I did not include the endpoint, e.g., my url was "s3://bioinformatics_scratch/kmhernan/meta.txt"
I hope this helps
Added support for s3 endpoints in the latest bunny release. The lib I used had some issues with google storage s3 due to differences in the request parameters, so not sure it will work with IBM's implementation.
Scatter seems to be working for me. Whether or not all tasks are started simultaneously is determined by the pool size which can now be configured (rabix.tes.task-thread-pool-size
).
When running with the TES backend (funnel) using a workflow that has
scatter
nothing seems to actually be submitted to the TES. I have provided the DEBUG log of the run. After the last line bunny just seems to hang.