Closed nick-youngblut closed 5 years ago
Sorry I haven't response this issue for several months, it's likely some taxids were merged (merged.dmp
) or deleted (delnodes.dmp) in newer NCBI taxonomy database.
It's was merged into 1458425 since 2018-12. I'll fix this soon
$ pigz -cd taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 1458427 \
| csvtk cut -F -f -lineage* \
| csvtk pretty
taxid version change change-value name rank
1458427 2014-08-01 NEW Comamonadaceae bacterium H1 species
1458427 2018-12-01 MERGE 1458425 Comamonadaceae bacterium H1 species
$ echo 1458425 | taxonkit lineage
1458425 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei
We check deleted and merged taxids now.
$ echo 123124124,3,92489,1458427,562 | rush -k -D , \
| taxonkit lineage --verbose
13:26:21.451 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp
13:26:21.573 [INFO] 415424 delnodes parsed
13:26:21.573 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp
13:26:21.596 [INFO] 54478 merged nodes parsed
13:26:21.596 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
13:26:23.585 [INFO] 2121511 names parsed
13:26:23.585 [INFO] parsing nodes file: /home/shenwei/.taxonkit/nodes.dmp
13:26:25.649 [INFO] 2121511 nodes parsed
13:29:25.649 [WARN ] taxid 123124124 not found
13:26:25.649 [WARN] taxid 3 was deleted
13:26:25.649 [WARN] taxid 92489 was merged into 796334
13:26:25.649 [WARN] taxid 1458427 was merged into 1458425
123124124
3
92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae
1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei
562 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
Switch on flag -c/--show-stats-code
if you want check which taxids were deleted or merged. Codes:
$ go build && echo 123124124,3,92489,1458427,562 | rush -k -D , | ./taxonkit lineage -c
13:29:37.845 [WARN] taxid 123124124 not found
13:29:37.845 [WARN] taxid 3 was deleted
13:29:37.845 [WARN] taxid 92489 was merged into 796334
13:29:37.845 [WARN] taxid 1458427 was merged into 1458425
123124124 -1
3 0
92489 796334 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae
1458427 1458425 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei
562 562 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
Then you can filter the result
# not found
awk '$2<0' result.txt
# deleted
awk '$2==0' result.txt
# merged
awk '$2 > 0 && $1 != $2' result.txt
For some reason,
taxonkit lineage
does not return a taxonomy for taxonID 1458427, which is Comamonadaceae bacterium H1. I got taxonomies for all other taxa in my table (n =~ 2000), so it just appears to be an issue with taxonID 1458427. There is no warning.example table
output
command
conda-env