Closed morgan-sparks closed 4 years ago
thx for your question @morgan-sparks - @trvinh can you give some details?
It'd be worth adding a more detailed description to the function documentation too as it's pretty thin on what is being done within the function.
Hi @morgan-sparks ,
yes, you were right, the classification is basically based on the taxonomy hierarchy string. Given a list of species, first the function will collect all possible taxonomy ranks and their corresponding IDs for each species. Then, it will try to align the rank vectors, just like doing a sequence alignment (with each position is a taxonomy rank, instead of a nucleotide or an amino acid), and replace the ranks by their taxonomy IDs. After that, this aligned ID matrix will be used to cluster species that have similar taxonomy string together (with hierarchical clustering function hclust
).
For example, this is the rank vectors for 3 species:
SpecA: strain, species, family, phylum, kingdom, superkingdom
SpecB: strain, species, genus, family, kingdom, superkingdom
SpecC: strain, species, phylum, kingdom, superkingdom
Then our aligned rank matrix will look like:
Name strain species genus family phylum kingdom superkingdom
SpecA strain species NA family phylum kingdom superkingdom
SpecB strain species genus family NA kingdom superkingdom
SpecC strain species NA NA phylum kingdom superkingdom
It will be converted into this ID matrix (note: any missing ranks will have a pseudo ID from the previous rank):
Name strain species genus family phylum kingdom superkingdom
SpecA 11 21 21 41 51 61 71
SpecB 12 22 31 41 61 61 71
SpecC 13 23 23 23 52 61 71
That aligned ID matrix will be used for clustering. As you can see, our tree will be:
((SpecA, SpecB), SpecC),
since SpecA
and SpecB
belong together to the family 41, and all three share the same kingdom 61.
So, the classification clusters the taxa not only based on the ranks (e.g. species, family, or phylum,...), but also on the actual taxonomy clade (specified by the IDs). Which means, it can cluster Arabidopsis within plant clade and snakes into reptiles. The more info you have in the aligned ID matrix (taxa with detailed taxonomy string; and enough taxa that cover all possible ranks), the higher resolution you can get for your taxonomy tree. For example, this is the first some lines of a real aligned ID matrix I am working with (not exactly the same as the one class2tree
delivers, I just want give you an example how much data this matrix can/should have):
abbrName ncbiID fullName strain norank_491 norank_57678 norank_56615 norank_1086053 isolate forma varietas formaspecialis subspecies species norank_600669 norank_44542 speciessubgroup norank_2642122 norank_2619919 norank_2627626 norank_1919231 norank_38063 norank_2622230 norank_2642239 norank_2629106 norank_2618147 norank_2625095 norank_2641301 norank_2621923 norank_256003 norank_2627676 norank_2621901 norank_2644672 norank_1104572 norank_2642439 norank_1234666 norank_2619537 norank_2646925 norank_2643926 norank_2635519 norank_2638681 norank_2624677 norank_2685009 norank_2644212 norank_2631227 norank_2144190 norank_2622732 norank_69483 norank_2643768 norank_2641396 norank_2668073 norank_2648758 norank_1655640 norank_2639520 norank_2634974 norank_2620651 norank_2641160 norank_2634096 norank_2648975 norank_2684913 norank_1700837 norank_132567 norank_2626337 norank_2634179 norank_2629569 norank_2623058 norank_1727652 norank_2684908 norank_2636082 norank_886737 norank_2638413 norank_2638438 norank_2632030 norank_2630403 norank_1679063 norank_2648658 norank_2642017 norank_2631942 norank_2630810 norank_2625419 norank_2565780 norank_2629969 norank_2648942 norank_2648404 norank_2631116 norank_2624956 norank_2610901 norank_2499238 norank_2752537 norank_857194 norank_220671 norank_254878 norank_2641196 norank_2633939 norank_2678336 norank_2640012 speciesgroup norank_44537 subgenus genus norank_1535325 subtribe norank_1304792 norank_651142 norank_1593277 norank_2116545 norank_114584 norank_2683658 norank_218105 norank_1113537 norank_1142503 norank_588816 norank_1803510 family norank_469895 norank_715341 tribe norank_588815 norank_1648033 subfamily norank_359160 norank_147370 norank_2601530 norank_337687 norank_1912919 superfamily norank_2601529 norank_74971 norank_104431 norank_43741 norank_37567 norank_43738 parvorder norank_480117 norank_480118 infraorder norank_33351 norank_33349 norank_33347 suborder norank_33343 order norank_159987 norank_33083 norank_355688 norank_9263 norank_91836 norank_314147 norank_6970 norank_4734 superorder norank_71275 norank_186626 norank_1437201 norank_1489908 norank_1437010 subcohort norank_1489872 norank_91827 norank_9347 norank_123369 norank_123368 norank_123367 norank_123366 norank_123365 cohort norank_186625 norank_1489341 infraclass norank_32525 norank_71240 norank_4447 subclass norank_1437183 norank_1520881 norank_85512 class norank_2283796 norank_2290931 norank_2692248 norank_2283794 norank_2683659 norank_2683660 norank_715962 norank_715989 norank_404260 norank_436492 norank_58024 norank_436491 norank_436489 norank_436486 norank_8492 norank_1329799 norank_32561 norank_8457 norank_32524 norank_78536 norank_58023 norank_716546 norank_3193 norank_32523 norank_1338369 superclass norank_117571 norank_117570 norank_7776 norank_7742 subphylum norank_197562 norank_197563 norank_716545 phylum norank_1783276 norank_112252 norank_2611352 norank_2696291 norank_554915 norank_33630 norank_1206795 norank_33634 norank_2697495 norank_2698737 norank_3208 norank_1935183 norank_1783275 norank_2611341 norank_1798711 norank_68336 norank_1783257 norank_33511 norank_88770 norank_1783272 norank_1783270 norank_1206794 norank_33317 subkingdom norank_33213 norank_6072 kingdom norank_999999005 norank_999999001 norank_999999002 norank_999999003 norank_999999004 norank_33154 superkingdom norank_131567 root
ncbi100226 100226 Streptomyces coelicolor A3(2) 100226 100226 100226 100226 100226 100226 100226 100226 100226 100226 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1902 1477431 1477431 1477431 1883 1883 1883 1883 1883 1883 1883 1883 1883 1883 1883 1883 1883 1883 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 85011 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 1760 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 201174 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 2 131567 1
ncbi10090 10090 Mus musculus 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 10090 862507 10088 10088 10088 10088 10088 10088 10088 10088 10088 10088 10088 10088 10088 10088 10066 10066 10066 10066 10066 10066 39107 39107 39107 39107 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 1963758 1963758 9989 9989 9989 9989 9989 9989 314147 314147 314147 314146 314146 314146 314146 314146 1437010 1437010 1437010 1437010 9347 9347 9347 9347 9347 9347 9347 9347 9347 9347 32525 32525 32525 32525 32525 32525 32525 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 32524 32524 32524 32524 32524 32523 1338369 8287 117571 117570 7776 7742 89593 89593 89593 89593 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 33511 33511 33511 33511 33511 33511 33511 33213 6072 33208 33208 999999001 999999002 999999003 999999004 33154 2759 131567 1
ncbi10116 10116 Rattus norvegicus 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10116 10114 10114 10114 10114 10114 10114 10114 10114 10114 10114 10114 10114 10114 10114 10066 10066 10066 10066 10066 10066 39107 39107 39107 39107 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 337687 1963758 1963758 9989 9989 9989 9989 9989 9989 314147 314147 314147 314146 314146 314146 314146 314146 1437010 1437010 1437010 1437010 9347 9347 9347 9347 9347 9347 9347 9347 9347 9347 32525 32525 32525 32525 32525 32525 32525 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 40674 32524 32524 32524 32524 32524 32523 1338369 8287 117571 117570 7776 7742 89593 89593 89593 89593 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 7711 33511 33511 33511 33511 33511 33511 33511 33213 6072 33208 33208 999999001 999999002 999999003 999999004 33154 2759 131567 1
Have I answered your question, Morgan? I hope that it can help! :-) Best, Vinh
thanks @trvinh - can you add some of that to the function documentation?
thanks @trvinh - can you add some of that to the function documentation?
@sckott : yes, I can. But where should I add it to? Directly to the Rd block for each subfunction? But how can the users see that, since those functions are only internal?
I was thinking in the documentation for the class2tree function https://github.com/ropensci/taxize/blob/master/R/class2tree.R#L1-L52 - perhaps in the @details
section
you can also add docs to specific internal methods if you want, and then use @keywords internal
so that the manual file isn't exported (Seen in the listing of docs pages), but the user can still get to it by doing ?fxn-name
I will try :-)
thank you!
Thanks to you both @sckott @trvinh, this was very helpful and a nice option for those of us trying to make basic distance based phylogenies that span very broad species lists.
Session Info
```r ```Just a quick question, not really a bug. I used the classfication() function for a list of species (includes insects, plants, molluscs, etc.) I have to download taxonomic hierarchy and then converted it to a tree using class2tree(). Works great! Obviously these are not true phylogenetic relationships, but I was curious how distances are determined for the phylo object. I am guessing it has to do with the taxonomic hierarchy and that a genus is so far from a family, a family so far from an order and so on, but there are no distinctions made for distance between plants and reptiles and reptiles from amphibians. I am trying to make variance-covariance matrix to control for phylogenetic non-independence in a statistical model and I am wondering how far off my mark I am using this method. Thanks!