Closed tolot27 closed 6 years ago
It's a good suggestion. But in some cases, e.g., viruses which do not have that much rank levels. These will be some missing/"unclassified xxx" ranks, which actually do not refer to any taxid.
For example, for taxid 1327037
$ echo 1327037 | taxonkit lineage | taxonkit reformat -f "{k};{p};{c};{o};{f};{g};{s}"
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y Viruses;;;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y
Even if there are not that much rank levels, the latest rank is still species (see https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?name=1327037).
Implemented in v0.2.4-dev3 or later versions.
Note that: Lots of taxids share same taxon names like "diastema", "solieria" and "environmental samples", therefore it's not just about simply mapping taxon name to taxid. I use both taxon name and the name of it's parent taxid to get the right taxid. This also bring more accurate result when using flag -F/--fill-miss-rank
to estimate and fill missing rank. But it's slower because of reading names.dmp
twice.
PS: I spent 7 hours :tired_face:
$ cat lineage.txt
9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
11932 Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
$ cat lineage.txt | taxonkit reformat -t
9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens 2759;7711;40674;9443;9604;9605;9606
349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 2;74201;203494;48461;1647988;239934;239935
239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 2;74201;203494;48461;1647988;239934;239935
11932 Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;;;;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;;;;11632;11749;11932
314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B 2;;;;;;314101
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y Viruses;;;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y 10239;;;28883;10699;;1327037
$ cat lineage.txt | taxonkit reformat -t -F
9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens 2759;7711;40674;9443;9604;9605;9606
349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 2;74201;203494;48461;1647988;239934;239935
239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 2;74201;203494;48461;1647988;239934;239935
11932 Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;;;;11632;11749;11932
314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;uncultured murine large bowel bacterium BAC 54B 2;;;;;;314101
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y Viruses;unclassified Viruses phylum;unclassified Viruses class;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y 10239;;;28883;10699;;1327037
$ cat lineage.txt | taxonkit reformat -t -F | cut -f 2,3 | perl -pe 's/^/Orignial: /; s/\n/\n\n/; s/\t/\nReformat: /'
Orignial: cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
Reformat: Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens
Orignial: cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Orignial: cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Orignial: Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
Orignial: cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
Reformat: Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;uncultured murine large bowel bacterium BAC 54B
Orignial: Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y
Thanks for your great work. Currently, I don't understand why you have to read names.dmp twice. The only case I can image is if only a taxon name is provided, without any taxid.
I suggest taxid 376619
(Francisella tularensis subsp. holarctica LVS) as an additional test case because it also includes a subspecies
Thanks for your suggestions. I was fool. I'll improve it soon.
Code has been optimized for speed and memory occupation. You may redownload the binaries
And 376619
is added to the test data set, where 349741
is also a subspecies.
$ cat lineage.txt
9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS
349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
11932 Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
$ cat lineage.txt | ./taxonkit reformat -t -F > lineage.txt.reformat.fill
$ cat lineage.txt.reformat.fill \
| perl -pe 's/^/Taxid : /; \
s/\t/\nLineage : /; \
s/\t/\nReformat: /; \
s/\t/\nTaxids : /; \
print "\n";'
Taxid : 9606
Lineage : cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
Reformat: Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens
Taxids : 2759;7711;40674;9443;9604;9605;9606
Taxid : 9913
Lineage : cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
Reformat: Eukaryota;Chordata;Mammalia;unclassified Mammalia order;Bovidae;Bos;Bos taurus
Taxids : 2759;7711;40674;;9895;9903;9913
Taxid : 376619
Lineage : cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS
Reformat: Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis
Taxids : 2;1224;1236;72273;34064;262;263
Taxid : 349741
Lineage : cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Taxids : 2;74201;203494;48461;1647988;239934;239935
Taxid : 239935
Lineage : cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Reformat: Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Taxids : 2;74201;203494;48461;1647988;239934;239935
Taxid : 314101
Lineage : cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
Reformat: Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;uncultured murine large bowel bacterium BAC 54B
Taxids : 2;;;;;;314101
Taxid : 11932
Lineage : Viruses;Retro-transcribing viruses;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
Taxids : 10239;;;;11632;11749;11932
Taxid : 1327037
Lineage : Viruses;dsDNA viruses, no RNA stage;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
Reformat: Viruses;unclassified Viruses phylum;unclassified Viruses class;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y
Taxids : 10239;;;28883;10699;;1327037
@tolot27 Tell me if you have more suggestions or feedback, I would like to release a new version soon.
Two points I thought about: First, with the parameter -r, --miss-rank-repl string
it is possible to chose a different prefix than unclassified for missing ranks, but it is not possible to omit the prefix at all because the empty string ""
triggers the default.
Second, there is no fallback for the subspecies taxid. Only the parameter -R, --miss-taxid-repl string
exists, which is just a static string like the prefix for the subspecies rank. Because my main aim is to just filter out strain level resolution but keep subspecies resolution if applicable, I currently use awk
to copy the species taxid to the subspecies taxid, if the later one is empty. Than I can use the subspecies taxid as the taxon column with KronaTools. Maybe I should stick with awk
than enhancing taxonkit with such special cases.
For the first issue, you may use a special symbol and remove it at last.
Subspecies is supportted using place holder {S}
in flag -f
. 😄 http://bioinf.shenwei.me/taxonkit/usage/#reformat
subspecies is supportted using place holder {S}.
I know, but that was not my point. I talked about a column containing the taxid of the subspecies if defined or the taxid of the species, if no subspecies is defined. Currently, the subspecies taxid column is empty, if no subspecies is defined, which is correct, indeed. But a merged column is required, for instance for KronaTools. But probably, that's out of scope of taxonkit.
Yes, that can be done using awk
.
v0.2.4 is out.
You can set prefix "unclassified" by flag -p/--miss-rank-repl-prefix
now.
Since no "deepest" taxid of supported/requested ranks exist, you can close this issue.
Like the format string in
reformat
, it would be interesting to have placeholders for the taxids of the available ranks. Most useful for downstream analyses and visualization, i.e. of metagenomic data, would be the taxid of the species and subspecies rank. Often, the taxonomic classifiers are randomly or incorrectly choosing a certain strain or a lot of different strains of the same species/subspecies. That clutters the output. Having such a placeholder can be easily used to filter the dataset during visualization rather than during processing.