shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
361 stars 29 forks source link

no taxonomy for "1458427" #19

Closed nick-youngblut closed 5 years ago

nick-youngblut commented 5 years ago

For some reason, taxonkit lineage does not return a taxonomy for taxonID 1458427, which is Comamonadaceae bacterium H1. I got taxonomies for all other taxa in my table (n =~ 2000), so it just appears to be an issue with taxonID 1458427. There is no warning.

example table

name    taxonomy_id taxonomy_lvl    kraken_assigned_reads   added_reads new_est_reads   fraction_total_reads
Calditerrivibrio nitroreducens  477976  S   53  0   53  0.00000
Streptococcus sp. oral taxon 071    712630  S   22  16  38  0.00000
Halothece sp. PCC 7418  65093   S   26  9   35  0.00000
Acinetobacter beijerinckii  262668  S   17  3   20  0.00000
Cycloclasticus zancles  1329899 S   11  0   11  0.00000
Bacillus velezensis 492670  S   86  14  100 0.00000
Atopobium rimae 1383    S   91  5   96  0.00000
Tatumella morbirosei    642227  S   66  0   66  0.00000
Paenibacillus sp. MAEPY2    1395587 S   196 10  206 0.00001
Comamonadaceae bacterium H1 1458427 S   0   0   0   0
Sulfolobales archaeon AZ1   1326980 S   10  0   10  0.00000

output

name    taxonomy_id taxonomy_lvl    kraken_assigned_reads   added_reads new_est_reads   fraction_total_reads
Calditerrivibrio nitroreducens  477976  S   53  0   53  0.00000 cellular organisms;Bacteria;Deferribacteres;Deferribacteres;Deferribacterales;Deferribacteraceae;Calditerrivibrio;Calditerrivibrio nitroreducens    131567;2;200930;68337;191393;191394;545865;477976
Streptococcus sp. oral taxon 071    712630  S   22  16  38  0.00000 cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus sp. oral taxon 071  131567;2;1783272;1239;91061;186826;1300;1301;712630
Halothece sp. PCC 7418  65093   S   26  9   35  0.00000 cellular organisms;Bacteria;Terrabacteria group;Cyanobacteria/Melainabacteria group;Cyanobacteria;Oscillatoriophycideae;Chroococcales;Aphanothecaceae;Halothece cluster;Halothece;Halothece sp. PCC 7418    131567;2;1783272;1798711;1117;1301283;1118;1890450;92682;76023;65093
Acinetobacter beijerinckii  262668  S   17  3   20  0.00000 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;Acinetobacter beijerinckii   131567;2;1224;1236;72274;468;469;262668
Cycloclasticus zancles  1329899 S   11  0   11  0.00000 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Piscirickettsiaceae;Cycloclasticus;Cycloclasticus zancles  131567;2;1224;1236;72273;135616;34067;1329899
Bacillus velezensis 492670  S   86  14  100 0.00000 cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus subtilis group;Bacillus amyloliquefaciens group;Bacillus velezensis 131567;2;1783272;1239;91061;1385;186817;1386;653685;1938374;492670
Atopobium rimae 1383    S   91  5   96  0.00000 cellular organisms;Bacteria;Terrabacteria group;Actinobacteria;Coriobacteriia;Coriobacteriales;Atopobiaceae;Atopobium;Atopobium rimae   131567;2;1783272;201174;84998;84999;1643824;1380;1383
Tatumella morbirosei    642227  S   66  0   66  0.00000 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Tatumella;Tatumella morbirosei  131567;2;1224;1236;91347;1903409;82986;642227
Paenibacillus sp. MAEPY2    1395587 S   196 10  206 0.00001 cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Paenibacillaceae;Paenibacillus;Paenibacillus sp. MAEPY2   131567;2;1783272;1239;91061;1385;186822;44249;1395587
Comamonadaceae bacterium H1 1458427 S   0   0   0   0
Sulfolobales archaeon AZ1   1326980 S   10  0   10  0.00000 cellular organisms;Archaea;TACK group;Crenarchaeota;Thermoprotei;Sulfolobales;Sulfolobaceae;Candidatus Aramenus;Candidatus Aramenus sulfurataquae   131567;2157;1783275;28889;183924;2281;118883;2489210;1326980

command

cat TEST.tsv | taxonkit lineage --threads 12 -i 2 -t --data-dir /path/to/taxonkit/taxdump/ > TEST_tax.tsv

conda-env

# packages in environment at /ebio/abt3_projects/software/dev/llmgp/.snakemake/conda/e0ee16ae:
#
# Name                    Version                   Build  Channel
blast                     2.5.0                hc0b0e79_3    bioconda
boost                     1.63.0                   py27_2    conda-forge
bracken                   2.2              py27h2d50403_1    bioconda
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
certifi                   2018.11.29            py27_1000    conda-forge
icu                       56.1                          4    conda-forge
jellyfish                 1.1.12               h2d50403_0    bioconda
kraken                    1.1                  h470a237_2    bioconda
kraken2                   2.0.7_beta      pl526h2d50403_0    bioconda
libffi                    3.2.1                hfc679d8_5    conda-forge
libgcc-ng                 7.2.0                hdf63c60_3    conda-forge
libstdcxx-ng              7.2.0                hdf63c60_3    conda-forge
ncurses                   6.1                  hfc679d8_2    conda-forge
openssl                   1.0.2p               h470a237_2    conda-forge
perl                      5.26.2               h470a237_0    conda-forge
pigz                      2.3.4                         0    conda-forge
pip                       18.1                  py27_1000    conda-forge
python                    2.7.15               h33da82c_6    conda-forge
readline                  7.0                  haf1bffa_1    conda-forge
setuptools                40.6.3                   py27_0    conda-forge
sqlite                    3.26.0               hb1c47c0_0    conda-forge
taxonkit                  0.3.0                         1    bioconda
tk                        8.6.9                ha92aebf_0    conda-forge
wheel                     0.32.3                   py27_0    conda-forge
zlib                      1.2.11               h470a237_4    conda-forge
shenwei356 commented 5 years ago

Sorry I haven't response this issue for several months, it's likely some taxids were merged (merged.dmp) or deleted (delnodes.dmp) in newer NCBI taxonomy database.

shenwei356 commented 5 years ago

It's was merged into 1458425 since 2018-12. I'll fix this soon

$ pigz -cd taxid-changelog.csv.gz  \
    | csvtk grep -f taxid -p 1458427 \
    | csvtk cut -F -f -lineage* \
    | csvtk pretty 
taxid     version      change   change-value   name                          rank
1458427   2014-08-01   NEW                     Comamonadaceae bacterium H1   species
1458427   2018-12-01   MERGE    1458425        Comamonadaceae bacterium H1   species

$ echo 1458425 | taxonkit lineage 
1458425 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei
shenwei356 commented 5 years ago

We check deleted and merged taxids now.

$ echo 123124124,3,92489,1458427,562 | rush -k -D , \
    | taxonkit lineage --verbose
13:26:21.451 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp
13:26:21.573 [INFO] 415424 delnodes parsed
13:26:21.573 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp
13:26:21.596 [INFO] 54478 merged nodes parsed
13:26:21.596 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
13:26:23.585 [INFO] 2121511 names parsed
13:26:23.585 [INFO] parsing nodes file: /home/shenwei/.taxonkit/nodes.dmp
13:26:25.649 [INFO] 2121511 nodes parsed
13:29:25.649 [WARN ] taxid 123124124 not found
13:26:25.649 [WARN] taxid 3 was deleted
13:26:25.649 [WARN] taxid 92489 was merged into 796334
13:26:25.649 [WARN] taxid 1458427 was merged into 1458425
123124124
3
92489   cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae
1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei
562     cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli

Switch on flag -c/--show-stats-code if you want check which taxids were deleted or merged. Codes:

$ go build && echo 123124124,3,92489,1458427,562 | rush -k -D , | ./taxonkit lineage -c 
13:29:37.845 [WARN] taxid 123124124 not found
13:29:37.845 [WARN] taxid 3 was deleted
13:29:37.845 [WARN] taxid 92489 was merged into 796334
13:29:37.845 [WARN] taxid 1458427 was merged into 1458425
123124124       -1
3       0
92489   796334  cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae
1458427 1458425 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei
562     562     cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli

Then you can filter the result

# not found
awk '$2<0' result.txt

# deleted
awk '$2==0' result.txt

# merged
awk '$2 > 0 && $1 != $2' result.txt