nextstrain / mpox

Nextstrain build for mpox virus
https://nextstrain.org/mpox
MIT License
44 stars 19 forks source link

Support for GISAID data #63

Open huddlej opened 2 years ago

huddlej commented 2 years ago

For users who want to use GISAID data with this workflow, the following steps work nearly as expected.

These steps assume you have downloaded:

# Change into phylogenetic workflow directory.
cd phylogenetic/

# Create a data directory to download files into.
mkdir -p data/

# Download sequences: data/gisaid_pox_2022_06_16_19.fasta
# Download patient metadata: data/gisaid_pox_2022_06_16_19.tsv
# Note: patient metadata lacks submitting/originating lab.

# Parse out metadata from sequence deflines.
augur parse \
  --sequences data/gisaid_pox_2022_06_16_19.fasta \
  --fields strain gisaid_epi_isl date \
  --output-sequences data/sequences.fasta \
  --output-metadata data/sequence_metadata.tsv

# Join sequence metadata with patient metadata.
csvtk --tabs join -f 1 \
  data/sequence_metadata.tsv \
  data/gisaid_pox_2022_06_16_19.tsv > data/metadata.tsv

# TODO: Need a transform for GISAID locations like the one we have for GenBank.

# Run workflow.
nextstrain build \
  --docker \
  --cpus 1 \
  . \
  --configfile defaults/mpxv/config.yaml \
  --config strain_id_field=strain display_strain_field=strain

Note, the biggest issue with the implementation above is that there is no transform command to convert GISAID's location field to the standard Nextstrain geographic columns (region, country, division, and location). This means the default Augur filter logic that groups by country and year prints a warning message that it cannot find a "country" column and only groups. In Augur 16.0.0, this missing group-by column will produce an error message, so we should consider implementing the transform for GISAID locations.

Given the commands above, however, I get the following tree from the workflow:

image

The very long branches also indicate that users will need to manage their own list of strains to exclude, since strain names will not match GenBank accessions.

huddlej commented 2 months ago

For folks who are interested in this approach to using GISAID data with the modern mpox repo layout, you should run the commands above from inside the phylogenetic directory of this repository. I have updated the nextstrain build command in the example above to reflect updates in the Nextstrain ecosystem.

Note that the workflow is currently broken for GISAID data until #273 is resolved.

huddlej commented 2 months ago

Locally resolving #273 by adding the missing column to the metadata did not fix the workflow because there are still several hardcoded columns in other rules or scripts of the workflow that the metadata doesn't have. The bigger issue is that the workflow expects the data to have been passed through the ingest workflow which hints that maybe the better solution to the problem is to pass GISAID data through ingest first.

jameshadfield commented 2 months ago

The bigger issue is that the workflow expects the data to have been passed through the ingest workflow which hints that maybe the better solution to the problem is to pass GISAID data through ingest first.

I'd like us to avoid this need. I wrote "I don't think we expect (m)any users to write ingest pipelines, I see them as a framework for how the nextstrain team separates concerns for production builds." Spiking data into a nextstrain workflow is something we should support, with certain constraints. I think a good entry point to (any) workflow is to ask users to provide a merged metadata / sequences file (leveraging augur merge, augur curate, whatever they are comfortable with) and then assert at the start of the workflow any requirements of that data (e.g. column X is needed). I think Cornelius' comment is a sensible guideline when building phylo workflows to avoid needing so many specific columns: "check for presence of that column and make the filter dependent on whether it's present or not."

Mpox may be a great repo to push on this vision - there's lots of non-NCBI data and the workflow is relatively complex.