shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

"Error: Invalid taxonomic rank: realm" when making Diamond db of IMG/VR dataset #76

Closed poursalavati closed 1 year ago

poursalavati commented 1 year ago

Hello and thanks for developing this great tool,

I am currently in the process of preparing a Diamond database from IMG/VR data. I have already prepared the necessary data and successfully created the name and node files, as well as the taxid.map file.

However, I encountered an error during the diamond makedb process: "Error: Invalid taxonomic rank: realm".

I have attached the name and node files for your reference. The taxid.map file is quite large, but I have included the head of it in the links below:

nodes.dmp.txt names.dmp.txt head_id2tax.txt

For taxonkit step, I have attached the head of the input file in the link below: head_id2lin.txt

./taxonkit create-taxdump -A 1 id2lin -O example/taxdump -j 50

Diamond making db:

diamond makedb --in ../IMGVR_all_proteins.faa --db IMGVR_Taxoned --taxonmap example/taxdump/taxid-h.map --taxonnodes example/taxdump/nodes.dmp --taxonnames example/taxdump/names.dmp -p 40

Thank you in advance for any help you can provide regarding this issue. If you have any suggestions or solutions, I would greatly appreciate your assistance.
Naser,

shenwei356 commented 1 year ago

Hi Naser, DIAMOND seems not to recognize the rank of realm, so you can simply remove the column realm from the id2lin file with csvtk:

csvtk  cut -t -f -realm id2lin -o id2lin.no-realm

If DIAMOND does not support kingdom either, you may need to rename it superkingdom.

poursalavati commented 1 year ago

Thanks a lot for the suggestion. I've checked the diamond source code and instead of removing the realm, renamed it to superkingdom based on this:

const char* Rank::names[] = {
    "no rank", "superkingdom", "kingdom", "subkingdom", "superphylum", "phylum", "subphylum", "superclass", "class", "subclass", "infraclass", "cohort", "subcohort", "superorder",
    "order", "suborder", "infraorder", "parvorder", "superfamily", "family", "subfamily", "tribe", "subtribe", "genus", "subgenus", "section", "subsection", "series", "species group",
    "species subgroup", "species", "subspecies", "varietas", "forma", "strain", "biotype", "clade", "forma specialis", "genotype", "isolate", "morph", "pathogroup", "serogroup", "serotype", "subvariety"
};

But now there is another issue after making db, almost all sequences have no rank! seems an issue in node file.

Its end of diamond stdout:

Loading taxonomy names...  [0.021s]
**Loaded taxonomy names for 0 taxon ids.**
Loading taxonomy mapping file...  [355.924s]
Joining accession mapping...  [82.928s]
Writing taxon id list...  [2.292s]
Building taxonomy nodes...  [11.996s]
2147258865 taxonomy nodes processed.
Number of nodes assigned to rank:
**no rank           2147251888**
superkingdom      6
kingdom           10
subkingdom        0
superphylum       0
phylum            17
subphylum         0
superclass        39
class             0
subclass          0
infraclass        0
cohort            0
subcohort         0
superorder        0
order             64
suborder          0
infraorder        0
parvorder         0
superfamily       0
family            206
subfamily         0
tribe             0
subtribe          0
genus             2116
subgenus          0
section           0
subsection        0
series            0
species group     0
species subgroup  0
species           4519
subspecies        0
varietas          0
forma             0
strain            0
biotype           0
clade             0
forma specialis   0
genotype          0
isolate           0
morph             0
pathogroup        0
serogroup         0
serotype          0
subvariety        0

Closing the input file...  [0s]
Closing the database file...  [0.015s]

Database sequences                   220799163
Database letters                     49459660621
Accessions in database               220799163
Entries in accession to taxid file   216984561
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0
Database hash                        58748cfe915c91e69a43a88c27aa3e8b
Total time                           2155s

I already sent the node and name file. I would greatly appreciate your assistance if you have any suggestions or solutions.

poursalavati commented 1 year ago

I think the point is that I have a large number of sequences for each taxonomy id. And it seems that only one of each has been identified (or indexed) by Diamond. for example:

IMGVR_UViG_638276111_000001|638276111|638297712 541518477
IMGVR_UViG_638276111_000001|638276111|638297713 541518477
IMGVR_UViG_638276111_000001|638276111|638297714 541518477
IMGVR_UViG_638276111_000001|638276111|638297715 541518477
IMGVR_UViG_638276111_000001|638276111|638297716 541518477
IMGVR_UViG_638276111_000001|638276111|638297717 541518477
IMGVR_UViG_638276111_000001|638276111|638297718 541518477
IMGVR_UViG_638276111_000001|638276111|638297719 541518477
IMGVR_UViG_638276111_000001|638276111|638297720 541518477
IMGVR_UViG_638276111_000001|638276111|638297721 541518477
IMGVR_UViG_638276111_000001|638276111|638297722 541518477
IMGVR_UViG_638276111_000001|638276111|638297723 541518477

Hope there is some way to fix it!

shenwei356 commented 1 year ago

Seems you're using an old taxonkit version (<0.14.0), please update to 0.14.1

poursalavati commented 1 year ago

Seems you're using an old taxonkit version (<0.14.0), please update to 0.14.1

  • taxonkit create-taxdump:

    • save taxIds in int32 instead of uint32, as BLAST and DIAMOND do. #70

Thanks dear Wei, Im using Version: 0.14.1 and checked int32.

I will write an update on this so maybe help other similar cases:

I was reading Diamond code and found this part in taxonomy.ccp, seems Diamond edited the seq title in these conditions, and one matched my seq ids!

string get_accession(const string &title)
{
    size_t i;
    string t(title);
    if (t.compare(0, 6, "UniRef") == 0)
        t.erase(0, t.find('_', 0) + 1);
    else if ((i = t.find_first_of('|', 0)) != string::npos) {
        if (t.compare(0, 3, "gi|") == 0) {
            t.erase(0, t.find_first_of('|', i + 1) + 1);
            i = t.find_first_of('|', 0);
        }
        t.erase(0, i + 1);
        i = t.find_first_of('|', 0);
        if (i != string::npos)
            t.erase(i);
    }
    i = t.find_last_of('.');
    if (i != string::npos)
        t.erase(i);
    return t;
}

So what I did was replace pipes | to _ in both taxid.map and also fasta headers!

Thanks again for your support and suggestions, Naser