googlegenomics / dockerflow

Dockerflow is a workflow runner that uses Dataflow to run a series of tasks in Docker with the Pipelines API
Apache License 2.0
97 stars 17 forks source link

Finding files created by executable #16

Open seandavi opened 7 years ago

seandavi commented 7 years ago

I am working with a software package that takes as a parameter the name of an output directory. Then, a bunch of files are written to that output directory. I cannot see a way to specify those files as outputs since dockerflow appears to ignore the prefix when specifying the outputFile. Here is what the output directory structure looks like.

index/
├── hash.bin
├── header.json
├── indexing.log
├── quasi_index.log
├── rsd.bin
├── sa.bin
├── txpInfo.bin
└── versionInfo.json

Here is the code I am trying to use (with just one of the files above as an example):

    static Task salmonIndex = TaskBuilder.named("salmonIndex")
                .inputFile("fasta")
                .outputFile("indexVersion","index/versionInfo.json")
                .docker("seandavi/salmon")
                .preemptible(true)
                .diskSize("20")
                .memory(14)
                .cpu(2)
                .script("salmon index --index=index --transcripts=${fasta}")
                .build();

    static WorkflowArgs workflowArgs = ArgsBuilder.of()
            .input("fasta", "${fasta}") 
            .output("indexVersion", "${salmonIndex.indexVersion}")
            .build();

And here is the error I am getting. Note that the gsutil cp fails because the index/ in the path appears to be ignored.

(d81d2f3a5be0ea0c): java.lang.RuntimeException: 
com.google.cloud.dataflow.sdk.util.UserCodeException: java.lang.RuntimeException: 
com.google.cloud.dataflow.sdk.util.UserCodeException: java.lang.RuntimeException: 
com.google.cloud.dataflow.sdk.util.UserCodeException: 
com.google.cloud.genomics.dockerflow.runner.TaskException: Operation 
operations/ENbjmKaCKxj6h6-zxaC7zokBIL3n-N_FEioPcHJvZHVjdGlvblF1ZXVl failed. Details: 10: 
Failed to delocalize files: failed to copy the following files: "/mnt/data/100346066-versionInfo.json 
-> gs://gbseqdata/ockerflow_example/ch1/salmonIndex/index/versionInfo.json (cp failed: gsutil -q -
m cp -L /var/log/google-genomics/out.log /mnt/data/100346066-versionInfo.json 
gs://gbseqdata/ockerflow_example/ch1/salmonIndex/index/versionInfo.json, command failed: 
CommandException: No URLs matched: /mnt/data/100346066-
versionInfo.json\nCommandException: 1 file/object could not be transferred.\n)" at 
com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:162

I am likely just misunderstanding some pieces here, but I thought I would just go ahead and ask.

jbingham commented 7 years ago

Great that you're testing this! In theory Dockerflow can handle it, but I don't have a test case on it.

You should be able to define the outputFile's local path explicitly:

.outputFile("indexVersion","index/versionInfo.json","/mnt/data/index/versionInfo.json")

Does that work?

FYI that the meaningless number in the path is a workaround to avoid naming collisions and avoid non-existing directories by keeping everything in a flat directory on local disk. We have some ideas of how to fix that in the Pipelines API (or in Dockerflow), but haven't gotten around to it. One idea is to put files like this:

/mnt/data/gs/bucket/path/inputFile /mnt/data/gs/bucket/path/outputFile

Or we could split into separate inputs and outputs:

/mnt/data/inputs/gs/bucket/path/inputFile /mnt/data/outputs/gs/bucket/path/outputFile

Either way, we'd need to run "mkdir -p" so that tools can write to the folder if they're not checking to make sure it exists. Do you have a preference for either of these approaches?

seandavi commented 7 years ago

I think that the former is a little less complicated and perhaps less magical. I think that is also the way that Cromwell is handling this, also. There is no need for you to do the same thing, but following the precedent may not be a bad thing.

I'll try to test this for the time being, but I like you idea of a more permanent solution; if you think you'll get to it soon, let me know and I will avoid hacks for the time being.

jbingham commented 7 years ago

Thanks for the feedback. We're unlikely to have it fixed for you very soon -- I'd suggest using the explicit path for now. Let me know if it doesn't work and i can turn around a quick fix.

On Wed, Nov 2, 2016, 1:49 PM Sean Davis notifications@github.com wrote:

I think that the former is a little less complicated and perhaps less magical. I think that is also the way that Cromwell is handling this, also. There is no need for you to do the same thing, but following the precedent may not be a bad thing.

I'll try to test this for the time being, but I like you idea of a more permanent solution; if you think you'll get to it soon, let me know and I will avoid hacks for the time being.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/googlegenomics/dockerflow/issues/16#issuecomment-257994654, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqWVPs0acgfstxVN_qMQMIcO3powzks5q6PdzgaJpZM4KnI2x .

seandavi commented 7 years ago

Thanks, @jbingham. Sounds like a plan.

jbingham commented 7 years ago

@Sean, did you ever get this working? Need anything from me?

On Wed, Nov 2, 2016 at 2:23 PM Sean Davis notifications@github.com wrote:

Thanks, @jbingham https://github.com/jbingham. Sounds like a plan.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/googlegenomics/dockerflow/issues/16#issuecomment-258003443, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqawsBD2aL3WavxnoxWp68QEUXr5Gks5q6P9agaJpZM4KnI2x .

jbingham commented 7 years ago

@Sean, any updates on this? I'm looking into adding recursive directory copy, and I could change this behavior at the same time.

jbingham commented 7 years ago

PR for recursive folder copy is in now. The change to folder paths isn't yet.

@seandavi FYI!