shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

Possible issue with `filter --save-predictable-norank` and merged taxids #80

Closed standage closed 1 year ago

standage commented 1 year ago

Prerequisites

Describe your issue

I've had the equivalent of the following code in pytaxonkit's test suite for a while now.

echo -e "131567\n2\n1224\n1236\n91347\n543\n561\n562\n2605619\n10239\n2731341\n2731360\n2731618\n2731619\n28883\n10699\n196894\n1327037\n" \
    | taxonkit filter --threads 1 --equal-to species --lower-than species --save-predictable-norank

In recent weeks this test started causing CI failures—the command would just hang for hours. Only today have I had a chance to track down the issue. After a bit of trial and error, I discovered (with the taxonkit lineage command) that a few of these taxids had been merged.

echo -e "131567\n2\n1224\n1236\n91347\n543\n561\n562\n2605619\n10239\n2731341\n2731360\n2731618\n2731619\n28883\n10699\n196894\n1327037\n" \
    | taxonkit lineage
14:50:33.260 [WARN] taxid 28883 was merged into 2731619
14:50:33.260 [WARN] taxid 10699 was merged into 2731619
14:50:33.260 [WARN] taxid 196894 was merged into 2788787
131567  cellular organisms
2       cellular organisms;Bacteria
1224    cellular organisms;Bacteria;Pseudomonadota
1236    cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria
91347   cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales
543     cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae
561     cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia
562     cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2605619 cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli O16:H48
10239   Viruses
2731341 Viruses;Duplodnaviria
2731360 Viruses;Duplodnaviria;Heunggongvirae
2731618 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota
2731619 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
28883   Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
10699   Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
196894  Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;unclassified Caudoviricetes
1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;unclassified Caudoviricetes;Croceibacter phage P2559Y

When I dropped or replaced the merged taxids, the problem went away and I got the expected answer.

echo -e "131567\n2\n1224\n1236\n91347\n543\n561\n562\n2605619\n10239\n2731341\n2731360\n2731618\n2731619\n2788787\n1327037" \
    | taxonkit filter --threads 1 --equal-to species --lower-than species --save-predictable-norank
562
2605619
1327037

Can you confirm that taxonkit filter is choking on merged taxids here?

shenwei356 commented 1 year ago

It's a bug~ The filter did not check merged/deleted taxids ...

standage commented 1 year ago

That fixes it. Thanks!