Closed standage closed 3 years ago
It looks like these may be duplicated, unmerged taxid.
Yes, they are. They should be merged.
taxonkit reformat
parses the complete lineages instead of reading TaxIds and querying lineage in real-time, in cases of the TaxIds are not available.
It retrieves TaxId of every taxon node by the combination of child and parent name for eliminating name ambiguity.
However, 2507530 and 2516889 have the exactly same lineage :( refromat
would fail to distinguish them.
One solution is giving an option to specify the TaxId field for cases where TaxIds are available. Meanwhile, cases of TaxIds with the same complete lineages should be detected while parsing taxdump files.
There are 52 more cases.
child,parent taxid1,taxid2
------------------------------------------------ ----------------
Russula sp. 12 KA-2019, unclassified Russula 2507523, 2516885
Russula sp. 14 KA-2019, unclassified Russula 2507524, 2516886
Russula sp. 15 KA-2019, unclassified Russula 2516887, 2507525
Russula sp. 1 KA-2019, unclassified Russula 2516884, 2507521
Russula sp. 5 KA-2019, unclassified Russula 2516888, 2507527
Russula sp. 8 KA-2019, unclassified Russula 2516889, 2507530
One solution is giving an option to specify the TaxId field for cases where TaxIds are available. Meanwhile, cases of TaxIds with the same complete lineages should be detected while parsing taxdump files.
Done.
Now, for these cases, warning messages are shown, and no data returns.
But you can use -a/--output-ambiguous-result
to return one possible result, like the old version did.
echo 2507530 \
| taxonkit lineage --show-lineage-taxids --show-rank --show-status-code --show-name --show-lineage-ranks \
| taxonkit reformat --lineage-field 3 --show-lineage-taxids
19:27:53.478 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result
2507530 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 131567;2759;33154;4751;451864;5204;5302;155619;355688;452342;5401;5402;2602424;2507530 Russula sp. 8 KA-2019 species no rank;superkingdom;clade;kingdo;subkingdom;phylum;subphylum;class;no rank;order;family;genus;no rank;species
echo 2507530 \
| taxonkit lineage --show-lineage-taxids --show-rank --show-status-code --show-name --show-lineage-ranks \
| taxonkit reformat --lineage-field 3 --show-lineage-taxids -a
19:30:23.031 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result
2507530 2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 131567;2759;33154;4751;451864;5204;5302;155619;355688;452342;5401;5402;2602424;2507530 Russula sp. 8 KA-2019 species no rank;superkingdom;clade;kingdom;subkingdom;phylum;subphylum;class;no rank;order;family;genus;no rank;species Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530
If TaxIds are available, use -I/--taxid-field
to tell the filed of TaxIds. :champagne:
$ echo -ne "2507530\n2516889\n" | TAXONKIT_DB=. taxonkit reformat -I 1 -t
2507530 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530
2516889 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2516889
Tremendous. Thank you!
By the way, I submitted Russula sp. 12 KA-2019, unclassified Russula 2507523, 2516885
to the NCBI help desk yesterday, before your response. Maybe we should just point them to this thread for all the others. 😀
Hi @standage , any responce from NCBI?
Do you have any other issues while using or suggestions? I'd like to release a new version with this improved reformat
.
I haven't had any other issues, thanks!
NCBI responded with the following.
Thank you very much for the notice. We have merged several such erroneous duplicates.
I didn't point them to this thread, I only mentioned Russula sp. 12 KA-2019, unclassified Russula 2507523, 2516885
in my ticket, and I haven't checked whether the latest update fixes the cases you found. So I'm not sure what the status is.
I check the latest taxdump files, some were merged while some not.
09:29:49.752 [WARN] taxid 2516885 was merged into 2507523
09:29:49.752 [WARN] taxid 2516886 was merged into 2507524
09:29:49.752 [WARN] taxid 2516887 was merged into 2507525
09:29:49.752 [WARN] taxid 2516884 was merged into 2507521
09:29:49.752 [WARN] taxid 2516888 was merged into 2507527
09:29:49.752 [WARN] taxid 2516889 was merged into 2507530
$ echo -ne "1105130\n2718636" | TAXONKIT_DB=. taxonkit lineage | TAXONKIT_DB=. taxonkit reformat -t
[09:31:27.603 [WARN] we can't distinguish the TaxIds (1105130, 2777044) for lineage: cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Cnidaria;Cubozoa;Chirodropida;Chiropsalmidae;Chiropsoides. But you can use -a/--output-ambiguous-result to return one possible result
09:31:27.603 [WARN] we can't distinguish the TaxIds (2713500, 2718636) for lineage: cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Listeriaceae;Listeria;unclassified Listeria;Listeria sp. FSL_L7-0091. But you can use -a/--output-ambiguous-result to return one possible result
1105130 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Cnidaria;Cubozoa;Chirodropida;Chiropsalmidae;Chiropsoides
2718636 cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Listeriaceae;Listeria;unclassified Listeria;Listeria sp. FSL_L7-0091
@shenwei356 First of all, thank you very much for creating this great tool! It has been very helpful in my research.
If I understood correctly, the warning should only appear, if two lineages are completely identical. However, I also get this warning for two species with the same name and a different lineage. I am using taxonkit 0.80 and the taxdump downloaded today.
echo -ne "46515\n" | taxonkit lineage | taxonkit reformat
produces
[WARN] we can't distinguish the TaxIds (46515, 1276929)
But the lineages of the two taxa are not identical:
46515 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Echinodermata;Eleutherozoa;Asterozoa;Asteroidea;Valvatacea;Valvatida;Asterinidae;Asterina;Asterina gibbosa
1276929 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Ascomycota;saccharomyceta;Pezizomycotina;leotiomyceta;dothideomyceta;Dothideomycetes;Dothideomycetes incertae sedis;Asterinales;Asterinaceae;Asterina;Asterina gibbosa
Is this expected behavior? Have a nice day, Felix
By default, taxonkit reformat find the taxid from the taxon name and name of its parent taxon. Here, it's "Asterina;Asterina gibbosa".
If TaxIds are available, use -I/--taxid-field
to tell the filed of TaxIds. :champagne:
$ echo -ne "2507530\n2516889\n" | TAXONKIT_DB=. taxonkit reformat -I 1 -t
2507530 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530
2516889 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2516889
Thank you for your swift reply! That makes sense. Actually, I wasn't aware of that option, but it makes life easier for me.
Hello, I noticed some unexpected behavior today. When I query and reformat the lineage for taxid 2507530,
taxonkit reformat
re-assigns 2516889 as the taxid in the output (the last taxid in the line).It looks like these may be duplicated, unmerged taxids.
Obviously, we should hope NCBI fixes this in the taxdump soon. But I'm assuming this is not the intended
taxonkit
behavior?Prerequisites
taxonkit version
Describe your issue