Ingest from NCBI Virus + NCBI Datasets

joverlee521 commented 1 month ago

Description of proposed changes

NCBI Datasets does not include the strain, serotype, and segment in the metadata, so this PR uses the NCBI Virus vvsearch2 API to pull down the metadata and joins it with the sequences downloaded via NCBI Datasets.

On this branch, I can run the ingest workflow with

nextstrain build ingest ingest_ncbi --configfile build-configs/ncbi/defaults/config.yaml

then successfully run the genome build (~using tmp changes in 4b511fc25620ea1fd6ef24846372ffc124344134~)

nextstrain build . --snakefile Snakefile.genome --config local_ingest=True ingest_source=ncbi

~Still needs some clean up but at least wanted to get this out for people to try out.~

Major change

This PR completely changes over the h5n1-cattle-outbreak build to the NCBI data. See the latest build at https://nextstrain.org/staging/avian-flu/ncbi/h5n1-cattle-outbreak/genome

Related issue(s)

Related to #37

Checklist

[x] Checks pass
[x] Add description.md for h5n1-cattle-outbreak
[x] phylo include/exclude strain names
[x] Add uploads to S3 -> tracking in https://github.com/nextstrain/avian-flu/issues/41
[x] join NCBI w/ Anderson lab data -> tracking in https://github.com/nextstrain/avian-flu/issues/42
[x] add ingest automation -> tracking in https://github.com/nextstrain/avian-flu/issues/43
[x] nextclade run as part of ingest -> tracking in https://github.com/nextstrain/avian-flu/issues/44

joverlee521 commented 1 month ago

The h5n1-cattle-outbreak build is still using the shared config/description.md. I think we should have a separate description.md that points to the public data sources.

trvrb commented 1 month ago

Nice work here @joverlee521! I was able to run this locally and everything worked in a clean fashion. The revised exclude list looks appropriate. After revision, the only blocking issue that I see is that division isn't coming through for most of the sequences. For example A/turkey/Missouri/24-005369-001/2024 lists division as USA. You can see this issue here: https://nextstrain.org/staging/avian-flu/ncbi/h5n1-cattle-outbreak/genome?c=division.

I don't see "Missouri" listed anywhere on accession PP748223. But conversely, A/Bovine/texas/24-029328-01/2024 lists division as Texas, while the accession PP599465 lists country as USA: texas.

Is there a way to more thoroughly collect division information via NCBI? Or do we need to somewhere scrape this from strain name during ingest?

joverlee521 commented 1 month ago

I don't see "Missouri" listed anywhere on accession PP748223. But conversely, A/Bovine/texas/24-029328-01/2024 lists division as Texas, while the accession PP599465 lists country as USA: texas.

Hmm, I see that A/Bovine/texas/24-029328-01/2024 has the correct division in the build.

Is there a way to more thoroughly collect division information via NCBI? Or do we need to somewhere scrape this from strain name during ingest?

This is all the data that's available through the usual metadata from NCBI, I'll just have to scrape the strain name for additional data.

trvrb commented 1 month ago

This is all the data that's available through the usual metadata from NCBI, I'll just have to scrape the strain name for additional data.

This makes sense. I don't see a way around this. Sorry that this is so difficult.

joverlee521 commented 1 month ago

Division is more filled out with metadata from the strain name using https://github.com/nextstrain/avian-flu/pull/40/commits/19bc00eed3ac64178f3844628befd1816e27dc87:

Screenshot 2024-05-29 at 3 19 13 PM

I'll plan to merge this tomorrow if there are no other comments.

nextstrain / avian-flu