Closed joverlee521 closed 1 month ago
The h5n1-cattle-outbreak build is still using the shared config/description.md. I think we should have a separate description.md that points to the public data sources.
Nice work here @joverlee521! I was able to run this locally and everything worked in a clean fashion. The revised exclude list looks appropriate. After revision, the only blocking issue that I see is that division
isn't coming through for most of the sequences. For example A/turkey/Missouri/24-005369-001/2024
lists division
as USA
. You can see this issue here: https://nextstrain.org/staging/avian-flu/ncbi/h5n1-cattle-outbreak/genome?c=division.
I don't see "Missouri" listed anywhere on accession PP748223. But conversely, A/Bovine/texas/24-029328-01/2024
lists division
as Texas
, while the accession PP599465 lists country as USA: texas
.
Is there a way to more thoroughly collect division information via NCBI? Or do we need to somewhere scrape this from strain name during ingest?
I don't see "Missouri" listed anywhere on accession PP748223. But conversely, A/Bovine/texas/24-029328-01/2024 lists division as Texas, while the accession PP599465 lists country as USA: texas.
Hmm, I see that A/Bovine/texas/24-029328-01/2024
has the correct division in the build.
Is there a way to more thoroughly collect division information via NCBI? Or do we need to somewhere scrape this from strain name during ingest?
This is all the data that's available through the usual metadata from NCBI, I'll just have to scrape the strain name for additional data.
This is all the data that's available through the usual metadata from NCBI, I'll just have to scrape the strain name for additional data.
This makes sense. I don't see a way around this. Sorry that this is so difficult.
Division is more filled out with metadata from the strain name using https://github.com/nextstrain/avian-flu/pull/40/commits/19bc00eed3ac64178f3844628befd1816e27dc87:
I'll plan to merge this tomorrow if there are no other comments.
Description of proposed changes
NCBI Datasets does not include the
strain
,serotype
, andsegment
in the metadata, so this PR uses the NCBI Virusvvsearch2
API to pull down the metadata and joins it with the sequences downloaded via NCBI Datasets.On this branch, I can run the ingest workflow with
then successfully run the genome build (~using tmp changes in 4b511fc25620ea1fd6ef24846372ffc124344134~)
~Still needs some clean up but at least wanted to get this out for people to try out.~
Major change
This PR completely changes over the h5n1-cattle-outbreak build to the NCBI data. See the latest build at https://nextstrain.org/staging/avian-flu/ncbi/h5n1-cattle-outbreak/genome
Related issue(s)
Related to #37
Checklist
nextclade run
as part of ingest -> tracking in https://github.com/nextstrain/avian-flu/issues/44