Open apcamargo opened 2 years ago
Yes, NCBI taxonomy uses consecutive numbers too. I guess they have a mapping table to maintain these relationships.
For reference, this is the script I used to make the taxids sequential: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py
When you create a taxdump using
create-taxdump
(ICTV taxonomy, for example), the taxids "skip" some numbers. For example:This is not a problem in itself, as the nodes are still connected. However, this causes a bug when you try to create a MMSeqs2 taxonomy database using the custom taxonomy, as it apparently assumes that numbers are not skipped (unless they are in delnodes.dmp and merged.dmp, I guess).
I wrote a script that mapped taxids such that no number is skipped and it solved the issue.
This is not a TaxonKit bug in any way. But because MMSeqs2 is pretty popular, I thought it was best to report this here in case anyone else faces the same issue.