Open huddlej opened 2 years ago
For folks who are interested in this approach to using GISAID data with the modern mpox repo layout, you should run the commands above from inside the phylogenetic
directory of this repository. I have updated the nextstrain build
command in the example above to reflect updates in the Nextstrain ecosystem.
Note that the workflow is currently broken for GISAID data until #273 is resolved.
Locally resolving #273 by adding the missing column to the metadata did not fix the workflow because there are still several hardcoded columns in other rules or scripts of the workflow that the metadata doesn't have. The bigger issue is that the workflow expects the data to have been passed through the ingest
workflow which hints that maybe the better solution to the problem is to pass GISAID data through ingest first.
The bigger issue is that the workflow expects the data to have been passed through the ingest workflow which hints that maybe the better solution to the problem is to pass GISAID data through ingest first.
I'd like us to avoid this need. I wrote "I don't think we expect (m)any users to write ingest pipelines, I see them as a framework for how the nextstrain team separates concerns for production builds." Spiking data into a nextstrain workflow is something we should support, with certain constraints. I think a good entry point to (any) workflow is to ask users to provide a merged metadata / sequences file (leveraging augur merge
, augur curate
, whatever they are comfortable with) and then assert at the start of the workflow any requirements of that data (e.g. column X is needed). I think Cornelius' comment is a sensible guideline when building phylo workflows to avoid needing so many specific columns: "check for presence of that column and make the filter dependent on whether it's present or not."
Mpox may be a great repo to push on this vision - there's lots of non-NCBI data and the workflow is relatively complex.
For users who want to use GISAID data with this workflow, the following steps work nearly as expected.
These steps assume you have downloaded:
Note, the biggest issue with the implementation above is that there is no transform command to convert GISAID's location field to the standard Nextstrain geographic columns (region, country, division, and location). This means the default Augur filter logic that groups by country and year prints a warning message that it cannot find a "country" column and only groups. In Augur 16.0.0, this missing group-by column will produce an error message, so we should consider implementing the transform for GISAID locations.
Given the commands above, however, I get the following tree from the workflow:
The very long branches also indicate that users will need to manage their own list of strains to exclude, since strain names will not match GenBank accessions.