ohsu-comp-bio / cwl-tes

cwl-tes submits your tasks to a TES server. Task submission is parallelized when possible.
Apache License 2.0
18 stars 28 forks source link

AWS s3 support #34

Open kmavrommatis opened 4 years ago

kmavrommatis commented 4 years ago

I have created a draft cwl-tes version that is accepting as inputs s3 URIs and creates outputs to s3 as well. This is based on the existing tool with an s3 oriented stdFsAccess class (according to #6)

However, although the process proceeds and ends with success the final output of the tool is referencing local files e.g.

{
    "whirpoolOutput": {
        "location": "file:///opt/cwl-tes/whirpool",
        "basename": "whirpool",
        "class": "File",
        "size": 200,
        "path": "/opt/cwl-tes/whirpool"
    }
}

although the location should be an s3 URI, and presumably path should be either non existent or empty.

I have not been able to find the place where this change happens. Is there any suggestion how to address this problem? Thanks in advance for any help.

kellrott commented 4 years ago

@mr-c wrote the FTP driver. He might have a better idea where that happens.

@kmavrommatis have you pushed this new S3 branch onto Github yet?

adamstruck commented 4 years ago

The default output path is set via a command line arg: https://github.com/ohsu-comp-bio/cwl-tes/blob/master/cwl_tes/main.py#L406

Maybe that is the issue?

kmavrommatis commented 4 years ago

Thanks for the suggestions.

@kellrott . I have not pushed it yet, I am still testing it.

@adamstruck. When I set the output directory to an s3 location the object location is taking the form /object

e.g. setting the location --outdir s3://mytes-test/ the output is:

"whirpoolOutput": {
        "location": "file:///opt/cwl-tes/s3%3A/mytes-test/funnel_1/whirpool",
        "basename": "whirpool",
        "class": "File",
        "size": 200,
        "path": "/opt/cwl-tes/s3:/mytes-test/funnel_1/whirpool"
    },

I will keep looking .

kmavrommatis commented 4 years ago

Update: The problem seems to come from the way cwltool handles files and file URIs. I have made a pull request with cwltool to fix this. As soon this is in place I will make a pull request for cwl-tes with the additional support for s3.

kellrott commented 4 years ago

It looks like @jvkersch has been working on an S3 implementation in parallel, in #38 Is there any way to combine effort?

jvkersch commented 4 years ago

I would be happy to combine efforts, if possible. @kmavrommatis Probably the first step is to compare my and your approach and identify commonalities/differences. If you push up your PR, we can take turns reviewing each other's code.

While working on #38 I found that there were a couple of places in the main code (i.e. outside of the backend class) that needed touching up to deal with S3. I did not make any changes to cwltool, but I noticed that it sometimes is particular with what it accepts.

kmavrommatis commented 4 years ago

Absolutely. I will make a pull request in the next couple days.

kmavrommatis commented 4 years ago

@jvkersch, @kellrott Apologies for the long delay. I have made a PR (https://github.com/ohsu-comp-bio/cwl-tes/pull/41) with the initial working code for the use of cwl-tes in an AWS environment. This is still work in progress and more a baseline for discussion than a final solution.