nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Bug: Update dropped strains file to list accession instead of strain #24

Closed j23414 closed 6 months ago

j23414 commented 7 months ago

Current Behavior

Currently, strains listed in phylogenetic/config/dropped_strains.txt are not being dropped since https://github.com/nextstrain/dengue/commit/8ab810f10cc64e45a9c2baf1bf6df9d6561578bc

Expected behavior

Strains listed in dropped_strains.txt are not in the final phylogenetic tree.

How to reproduce

Possible solution

Perhaps cherry pick a commit like:

Your environment: if browsing Nextstrain online

Your environment: if running Nextstrain locally

Additional context

Add any other context about the problem here.

victorlin commented 7 months ago

Good catch! If I understand correctly, phylogenetic/config/dropped_strains.txt is used for augur filter --exclude so it should be updated alongside 8ab810f10cc64e45a9c2baf1bf6df9d6561578bc. Looking more carefully at #12, the --sequences input should also be updated to match the new ID values but I don't see that it was changed. Does it still work?

j23414 commented 7 months ago

Correct, and yes the --sequences input still works :D

git clone https://github.com/nextstrain/dengue.git
cd dengue/phylogenetic
nextstrain build . data/sequences_all.fasta
grep  ">" data/sequences_all.fasta | head -n5

Which shows the sequences are ID'd by accession, not strain name:

>NC_075403
>NC_075435
>OQ919688
>ON123563
>ON123564
victorlin commented 7 months ago

Good to know! But how did it work before #12 if the sequences were ID'd by accession and augur filter was using strain as the ID column?

j23414 commented 7 months ago

augur filter was using strain as the ID column?

This worked when we were still pulling strains from fauna (which also took advantage of the deduplication of strain names of fauna).

However, we shifted to using the ingest folder and pulling data using ncbi datasets. After ingest was merged, the data.nextstrain.org/files were updated so we could have a smooth transition.

victorlin commented 6 months ago

Closed by #26