naturalis / bio-cipres

Phylogenomic analysis on the CIPRES REST portal
MIT License
3 stars 4 forks source link

Rerunnable workflow as CWL #5

Open rvosa opened 4 years ago

rvosa commented 4 years ago

The goal of the basic workflow is to be able to consume unaligned FASTA, align this (i.e. solve #3) and build a tree with it (by addressing #4). These steps are implemented with tools, scripts, and web service calls that are all provisioned inside a Docker container (whose Dockerfile is in the root of the repo, and whose tag will be the same as the repo name).

Subsequently, these steps will be chained together using CWL, most of which is already scaffolded in PR #1. The essential test is therefore that we should be able to run the whole thing on a clean computer using something like cwl-runner. We will then submit this to covid19.workflowhub.eu.

rvosa commented 4 years ago

(For consulting on how to finalize this, we might talk to Tazro and/or Michael Crusoe)

rvosa commented 4 years ago

(For consulting on workflow hub registration, we can consult Carole Goble)

mr-c commented 4 years ago

@rvosa Happy to help!

rvosa commented 4 years ago

Hi @mr-c, thanks! Here's something I'm wondering about. In this repo they built a little workflow that does the alignment and tree building locally, for the purpose of then doing a tree shape analysis that assigns clade identifiers to the different sequences. People running that pipeline might experience some performance issues especially with the alignment step, because MAFFT is kind of expensive.

To address that, I would like to be able to provide our pipeline to them so that the compute steps are done on the CIPRES server instead.

Could you sketch out the steps of what it would take for our project to be portable enough so that that would be as painless as possible. I'm thinking something like:

  1. our docker container is on docker hub
  2. the CWL orchestrates the interaction with the container to do our pipeline
  3. the CWL workflow ends up on workflow hub ...
  4. the conda environment.yml that they're running pulls in our workflow
rvosa commented 4 years ago

i.e. what are the ... steps that would need to happen?

mr-c commented 4 years ago

Hello @rvosa !

I think that is a great idea to both run the analysis and also provide a portable "take home" version.

  1. Make a CWL workflow. Ensure that each application has its own Docker container, preferably from biocontianers.pro
  2. Distribute this workflow. Users can run it from any CWL compatible system. The workflow should also be registered with the Workflow HUb
  3. No need for a conda environment.yml, their CWL runners will automatically use the Docker containers. If you'd like to have a non-Docker version then we can add SoftwareRequirement hints, which some CWL runners will translate into conda packages
rvosa commented 4 years ago

Hi @mr-c,

well, for step 3 the issue is not so much that we need an environment.yml (we don't), the issue is that these guys distribute their pipeline with an environment.yml. What I would like to accomplish is that we can contribute our work as a drop-in replacement for some of the steps they've been taking. How would that work?

mr-c commented 4 years ago

While I've never packaged a CWL workflow as a single Conda tool, it should be possible. A CWL workflow can start with #!/usr/bin/env cwl-runner and be marked executable. The Conda package could recommend or depend on the CWL reference runner, so everything would be invisible to the user. When using cwltool they would even get a --help output derived from the workflow inputs and doc property.

rvosa commented 4 years ago

How would it work the other way around? Like, I make conda recipes for the reusable tools developed here, and now I want to invoke those from CWL. Is there some facility that wraps that?

mr-c commented 4 years ago

There is a basic CWL workflow

https://view.commonwl.org/workflows/github.com/common-workflow-lab/2020-covid-19-bh/blob/8fd2d9814a5641a55efd8e63fa65a652b66f9d0b/msa/msa.cwl

Workflow diagram

It can be run locally:

cwltool https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa.cwl  \
  https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa_test.yaml

or via the Arvados instance at biohackathon.curii.com

arvados-cwl-runner https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa.cwl  \
  https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa_test.yaml
mr-c commented 4 years ago

Throughout this repository I found conflicting command line arguments in use, so please tell me the preferred options.

There are two options for the XSEDE version of IQTree that I was unable to decipher:

vparam.specify_runtype_=2 - Specify the nrun type - 2 for Tree Inference. and vparam.specify_numparts_=1 - How many partitions does your data set have.

Is there a source file that shows how http://www.phylo.org/index.php/rest/iqtree_xsede.html is turned into a command line?