googlegenomics / pipelines-api-examples

Examples for the Google Genomics Pipelines API.
BSD 3-Clause "New" or "Revised" License
50 stars 31 forks source link

What infrastructure is in place for handling dependencies between pipelines? #35

Open seandavi opened 7 years ago

seandavi commented 7 years ago

Complex bioinformatics workflows usually have dependencies from one step in a workflow to the next. Is there functionality in the pipelines API to handle this? If not, any suggestions on implementation of such workflow dependencies, ideally with caching of previously computed results?

pgrosu commented 7 years ago

Hi Sean,

It can be done in batches where you run/save data with one pipeline, and then pick it up with another since the I/O throughput with Google Storage is very fast. This is basically a variation of what WDL is performing via Cromwell with the JES (Pipeline API) as a backend.

Like you, I come from the same background, and I suggested a possible approach a while ago here:

https://github.com/googlegenomics/pipelines-api-examples/issues/13#issuecomment-194038122

But the purpose of Pipeline API in Google Genomics is different, and it's easier to just quote Matt:

The idea of the pipelines API as it stands is to provide this very simple, but powerful, building block. It will enable many different pipeline runners (like cromwell) to add a "run in cloud" feature to their existing workflow definition files without the need to explicitly provision a fixed cluster (like Grid Engine or Mesos).

There are other connected pipelines implementations one can use with Google like Dataflow, which come closer to what you looking for, but that requires that the data be loaded into the Google Genomics API first - below is a link to the repository with examples:

https://github.com/googlegenomics/dataflow-java

If you have not used the Dataflow API before, below is how to construct a Dataflow pipeline:

https://cloud.google.com/dataflow/pipelines/constructing-your-pipeline

I still think that connected pipelines is a critical feature of the Pipeline API.

Hope it helps, Paul

jbingham commented 7 years ago

In addition to Broad Institute's Cromwell runner for WDL, a new open source project called Funnel builds on top of the Pipelines API to support complex workflows defined using Common Workflow Language (CWL). Funnel is being developed by the folks who run the DREAM challenges. The authors presented progress at a workshop last week at Institute for Systems Biology.

Another idea is to write a tiny python wrapper that makes each call to the Pipelines API blocking. Then you can call it multiple times in a row, with different bioinformatics tools, to build a simple pipeline.

About Paul's suggestion of using Cloud Dataflow (and the Apache Beam python SDK), that's also a possibility. You can imagine using Pipelines API to run individual steps, and using Dataflow for orchestration.

Features like caching of intermediate results, and retries, and preemptible VMs to reduce cost, are all things that can be added, and are definitely desirable. They're outside of the current scope of the Pipelines API and are probably best built on top, along the lines of Cromwell and Funnel.

Cheers, Jonathan

seandavi commented 7 years ago

Thanks, Jonathan, for the clarification of scope for the pipeline API. I agree that building on top of it, treating the pipeline API as a "raw executor", makes a lot of sense. There are a number of really good workflow engines out there already. Adapting them to the Google Genomics pipeline API is probably just a matter of time for at least some of them.

jbingham commented 7 years ago

Do you have any particular workflow engines in mind that you'd like to see adapted to support the Pipelines API? Just curious which you like best.

On Tue, Aug 9, 2016 at 9:29 AM Sean Davis notifications@github.com wrote:

Thanks, Jonathan, for the clarification of scope for the pipeline API. I agree that building on top of it, treating the pipeline API as a "raw executor", makes a lot of sense. There are a number of really good workflow engines out there already. Adapting them to the Google Genomics pipeline API is probably just a matter of time for at least some of them.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238609443, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqQcGCOPYuFNMxv74RFzqLSOnKIXxks5qeKrIgaJpZM4JgFjs .

seandavi commented 7 years ago

Toil and nextflow are both getting a fair amount of love-and-care and have at least some support for CWL, abstract executors, and cloud files. Snakemake also has a good following, but I see this in a slightly different space. I also like the approach that https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns promises). A pretty full list is available here: https://github.com/pditommaso/awesome-pipeline

jbingham commented 7 years ago

Thanks! We're definitely keen to help the Toil folks support Pipelines API and Google Cloud Platform generally. What I really like about nextflow is that it's python.

On Tue, Aug 9, 2016 at 9:58 AM Sean Davis notifications@github.com wrote:

Toil and nextflow are both getting a fair amount of love-and-care and have at least some support for CWL, abstract executors, and cloud files. Snakemake also has a good following, but I see this in a slightly different space. I also like the approach that https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns promises). A pretty full list is available here: https://github.com/pditommaso/awesome-pipeline

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238618294, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqe0TVYnTUJTby2sL00tvgZpfHTxjks5qeLGigaJpZM4JgFjs .

seandavi commented 7 years ago

I assume that you mean toil is python? Nextflow is groovy-based, though it is really a mini language in a sense.

On Aug 9, 2016, at 1:07 PM, jbingham notifications@github.com wrote:

Thanks! We're definitely keen to help the Toil folks support Pipelines API and Google Cloud Platform generally. What I really like about nextflow is that it's python.

On Tue, Aug 9, 2016 at 9:58 AM Sean Davis notifications@github.com wrote:

Toil and nextflow are both getting a fair amount of love-and-care and have at least some support for CWL, abstract executors, and cloud files. Snakemake also has a good following, but I see this in a slightly different space. I also like the approach that https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns promises). A pretty full list is available here: https://github.com/pditommaso/awesome-pipeline

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238618294, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqe0TVYnTUJTby2sL00tvgZpfHTxjks5qeLGigaJpZM4JgFjs .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238620931, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFpEwJgUInjf_fbUdUgbfe8fXtoL7usks5qeLO0gaJpZM4JgFjs.

jbingham commented 7 years ago

Yes, typo! I meant Toil is python.

On Tue, Aug 9, 2016 at 10:08 AM Sean Davis notifications@github.com wrote:

I assume that you mean toil is python? Nextflow is groovy-based, though it is really a mini language in a sense.

On Aug 9, 2016, at 1:07 PM, jbingham notifications@github.com wrote:

Thanks! We're definitely keen to help the Toil folks support Pipelines API and Google Cloud Platform generally. What I really like about nextflow is that it's python.

On Tue, Aug 9, 2016 at 9:58 AM Sean Davis notifications@github.com wrote:

Toil and nextflow are both getting a fair amount of love-and-care and have at least some support for CWL, abstract executors, and cloud files. Snakemake also has a good following, but I see this in a slightly different space. I also like the approach that https://github.com/GoogleCloudPlatform/appengine-pipelines uses (returns promises). A pretty full list is available here: https://github.com/pditommaso/awesome-pipeline

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238618294 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAiXqe0TVYnTUJTby2sL00tvgZpfHTxjks5qeLGigaJpZM4JgFjs

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238620931>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAFpEwJgUInjf_fbUdUgbfe8fXtoL7usks5qeLO0gaJpZM4JgFjs .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/pipelines-api-examples/issues/35#issuecomment-238621323, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqUhMWS7J4DOifi5NaSfpvD-oDZweks5qeLQHgaJpZM4JgFjs .