Open rvosa opened 4 years ago
(For consulting on how to finalize this, we might talk to Tazro and/or Michael Crusoe)
(For consulting on workflow hub registration, we can consult Carole Goble)
@rvosa Happy to help!
Hi @mr-c, thanks! Here's something I'm wondering about. In this repo they built a little workflow that does the alignment and tree building locally, for the purpose of then doing a tree shape analysis that assigns clade identifiers to the different sequences. People running that pipeline might experience some performance issues especially with the alignment step, because MAFFT is kind of expensive.
To address that, I would like to be able to provide our pipeline to them so that the compute steps are done on the CIPRES server instead.
Could you sketch out the steps of what it would take for our project to be portable enough so that that would be as painless as possible. I'm thinking something like:
i.e. what are the ... steps that would need to happen?
Hello @rvosa !
I think that is a great idea to both run the analysis and also provide a portable "take home" version.
SoftwareRequirement
hints, which some CWL runners will translate into conda packages Hi @mr-c,
well, for step 3 the issue is not so much that we need an environment.yml (we don't), the issue is that these guys distribute their pipeline with an environment.yml. What I would like to accomplish is that we can contribute our work as a drop-in replacement for some of the steps they've been taking. How would that work?
While I've never packaged a CWL workflow as a single Conda tool, it should be possible. A CWL workflow can start with #!/usr/bin/env cwl-runner
and be marked executable. The Conda package could recommend or depend on the CWL reference runner, so everything would be invisible to the user. When using cwltool
they would even get a --help
output derived from the workflow inputs and doc
property.
How would it work the other way around? Like, I make conda recipes for the reusable tools developed here, and now I want to invoke those from CWL. Is there some facility that wraps that?
There is a basic CWL workflow
It can be run locally:
cwltool https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa.cwl \
https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa_test.yaml
or via the Arvados instance at biohackathon.curii.com
arvados-cwl-runner https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa.cwl \
https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa_test.yaml
Throughout this repository I found conflicting command line arguments in use, so please tell me the preferred options.
There are two options for the XSEDE version of IQTree that I was unable to decipher:
vparam.specify_runtype_=2 - Specify the nrun type - 2 for Tree Inference.
and
vparam.specify_numparts_=1 - How many partitions does your data set have.
Is there a source file that shows how http://www.phylo.org/index.php/rest/iqtree_xsede.html is turned into a command line?
The goal of the basic workflow is to be able to consume unaligned FASTA, align this (i.e. solve #3) and build a tree with it (by addressing #4). These steps are implemented with tools, scripts, and web service calls that are all provisioned inside a Docker container (whose Dockerfile is in the root of the repo, and whose tag will be the same as the repo name).
Subsequently, these steps will be chained together using CWL, most of which is already scaffolded in PR #1. The essential test is therefore that we should be able to run the whole thing on a clean computer using something like
cwl-runner
. We will then submit this to covid19.workflowhub.eu.