shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
361 stars 29 forks source link

duplicate taxid from name2taxid #29

Closed slaperriere closed 4 years ago

slaperriere commented 4 years ago

Hello,

I am getting duplicate values from name2taxid when running

taxonkit name2taxid -i 2 filename

My input: ESP_3 Bacteria ESP_84 Bacteria ESP_136 Bacteria ESP_149 Bacteria ESP_166 Bacteria ESP_169 Bacteria ESP_181 Bacteria ESP_187 Bacteria ESP_196 Bacteria

Output: ESP_3 Bacteria 2 ESP_3 Bacteria 629395 ESP_84 Bacteria 2 ESP_84 Bacteria 629395 ESP_136 Bacteria 2 ESP_136 Bacteria 629395 ESP_149 Bacteria 2 ESP_149 Bacteria 629395 ESP_166 Bacteria 2 ESP_166 Bacteria 629395

Some lines as seen above are duplicated with a different taxid. There are no duplicates in the input.

Do you you what could be causing this?

Thank you!

shenwei356 commented 4 years ago

Thanks for reporting this.

taxonkit name2taxid searches both scientific name and synonym, 629395 has a synonym of Bacteria...

629395  |       Bacteria        |       Bacteria <stick insect> |       synonym |
629395  |       Bacteria Latreille et al. 1825  |               |       scientific name |
629395  |       Bacteria Latreille, Peletier de Saint Fargeau, Serville & Guerin, 1825  |               |       authority       |
629395  |       Bacteria stick insect   |               |       common name     |

A new flag -s/--sci-name added for only searching scientific name:

slaperriere commented 4 years ago

Great, thank you! It looks like it solved most of the problem.

However, I am still get some duplicates. Some examples are

ESP_48538 Paracoccus 265 ESP_48538 Paracoccus 249411 ESP_764 Actinobacteria 1760 ESP_764 Actinobacteria 201174 ESP_17204 Vertebrata 7742 ESP_17204 Vertebrata 1261581

shenwei356 commented 4 years ago

it's not a bug, if you have switched on -s. Some taxids indeed share same scientific names, you can check their lineage. For these, I duplicate these lines, you may deduplicate them using awk or csvtk, or I can add a new flag.

shenwei356 commented 4 years ago

@slaperriere Can I close this issue?

slaperriere commented 4 years ago

Yes. Thank you for your help!