shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

taxids created with `create-taxdump` skip numbers #59

Open apcamargo opened 2 years ago

apcamargo commented 2 years ago

When you create a taxdump using create-taxdump (ICTV taxonomy, for example), the taxids "skip" some numbers. For example:

$ head ictv-taxdump/names.dmp
1   |   root    |       |   scientific name |
287205  |   Hoswirudivirus MRV1 |       |   scientific name |
287935  |   Shomudavirus limadaptatum   |       |   scientific name |
1096518 |   Sclerotimonavirus betaclarireediae  |       |   scientific name |
1138752 |   Potato virus H  |       |   scientific name |
1536674 |   Rhopapillomavirus 1 |       |   scientific name |
1845995 |   Monomorium pharaonis virus 1    |       |   scientific name |
1890985 |   Aquamavirus A   |       |   scientific name |
2079526 |   Hylipavirus |       |   scientific name |
2290567 |   Fattrevirus |       |   scientific name |

This is not a problem in itself, as the nodes are still connected. However, this causes a bug when you try to create a MMSeqs2 taxonomy database using the custom taxonomy, as it apparently assumes that numbers are not skipped (unless they are in delnodes.dmp and merged.dmp, I guess).

I wrote a script that mapped taxids such that no number is skipped and it solved the issue.

$ head ictv-taxdump/names.dmp
1   |   root    |       |   scientific name |
2   |   Hoswirudivirus MRV1 |       |   scientific name |
3   |   Shomudavirus limadaptatum   |       |   scientific name |
4   |   Sclerotimonavirus betaclarireediae  |       |   scientific name |
5   |   Potato virus H  |       |   scientific name |
6   |   Rhopapillomavirus 1 |       |   scientific name |
7   |   Monomorium pharaonis virus 1    |       |   scientific name |
8   |   Aquamavirus A   |       |   scientific name |
9   |   Hylipavirus |       |   scientific name |
10  |   Fattrevirus |       |   scientific name |

This is not a TaxonKit bug in any way. But because MMSeqs2 is pretty popular, I thought it was best to report this here in case anyone else faces the same issue.

shenwei356 commented 2 years ago

Yes, NCBI taxonomy uses consecutive numbers too. I guess they have a mapping table to maintain these relationships.

apcamargo commented 2 years ago

For reference, this is the script I used to make the taxids sequential: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py