shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
369 stars 29 forks source link

Filter duplicate taxIDs by order/class? #84

Closed philippbayer closed 1 year ago

philippbayer commented 1 year ago

Prerequisites

Describe your issue

As described in the manual for name2taxid, some names like 'Drosophila' return several taxonomy IDs. Would it be possible to add a flag (either in name2taxid or filter) that retains taxonomy IDs if they are in a specific class or order? In my case I'm only interested in fishes, so if the other duplicates are from plants etc. I could just remove them.

For example,

echo Rondeletia bicolor | taxonkit name2taxid | taxonkit filter --tax_subset 7898

would keep only 1311492 and discard 1368176 (as 1311492 is 'within' Actinopterygii, 7898). I have yet to encounter duplicate taxonomy IDs within the same class/superclass.

shenwei356 commented 1 year ago

Currently, you can filter the result using a white list from taxonkit list.

$ echo Rondeletia bicolor \
    | taxonkit name2taxid \
    | csvtk grep -Ht -f 2 -P <(taxonkit list --ids 7898 -I "")
Rondeletia bicolor      1311492
philippbayer commented 1 year ago

Thank you!!! That is good enough for me; I only work wit hfish so it's usually 7898, sometimes 7777. I'll close the issue

philippbayer commented 1 year ago

After implementing this filter I found 6 fish that have duplicate taxonomy IDs where both IDs are within the fish.

It seems that for those, taxonkit returns the 'good' taxonomy ID first followed by the outdated one. So I did this to get rid of the second hit using awk to keep only the first taxonkit-return per query ID:

grep '>' BOLD_chordata_28_07_2023.namesOnly.COI_species_only.fasta | sed 's/|/\t/g' | taxonkit name2taxid -i 2 | awk -F'\t' 'NR==1 || !a[$1]++' > Taxids.txt

These are the species in my dataset, I'm sure there are others: Bryconamericus caucanus, Centropogon australis, Oreonectes furcocaudalis, Pelvicachromis signatus, Trichomycterus variegatus.
But it works so I'm keeping this closed for others in the future :)