ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
270 stars 61 forks source link

Distance matrix from classification amd class2tree #849

Closed morgan-sparks closed 4 years ago

morgan-sparks commented 4 years ago
Session Info ```r ```

Just a quick question, not really a bug. I used the classfication() function for a list of species (includes insects, plants, molluscs, etc.) I have to download taxonomic hierarchy and then converted it to a tree using class2tree(). Works great! Obviously these are not true phylogenetic relationships, but I was curious how distances are determined for the phylo object. I am guessing it has to do with the taxonomic hierarchy and that a genus is so far from a family, a family so far from an order and so on, but there are no distinctions made for distance between plants and reptiles and reptiles from amphibians. I am trying to make variance-covariance matrix to control for phylogenetic non-independence in a statistical model and I am wondering how far off my mark I am using this method. Thanks!

sckott commented 4 years ago

thx for your question @morgan-sparks - @trvinh can you give some details?

It'd be worth adding a more detailed description to the function documentation too as it's pretty thin on what is being done within the function.

trvinh commented 4 years ago

Hi @morgan-sparks , yes, you were right, the classification is basically based on the taxonomy hierarchy string. Given a list of species, first the function will collect all possible taxonomy ranks and their corresponding IDs for each species. Then, it will try to align the rank vectors, just like doing a sequence alignment (with each position is a taxonomy rank, instead of a nucleotide or an amino acid), and replace the ranks by their taxonomy IDs. After that, this aligned ID matrix will be used to cluster species that have similar taxonomy string together (with hierarchical clustering function hclust).

For example, this is the rank vectors for 3 species:

SpecA: strain, species, family, phylum, kingdom, superkingdom
SpecB: strain, species, genus, family, kingdom, superkingdom
SpecC: strain, species, phylum, kingdom, superkingdom

Then our aligned rank matrix will look like:

Name    strain  species genus   family  phylum  kingdom superkingdom
SpecA   strain  species NA  family  phylum  kingdom superkingdom
SpecB   strain  species genus   family  NA  kingdom superkingdom
SpecC   strain  species NA  NA  phylum  kingdom superkingdom

It will be converted into this ID matrix (note: any missing ranks will have a pseudo ID from the previous rank):

Name    strain  species genus   family  phylum  kingdom superkingdom
SpecA   11  21  21  41  51  61  71
SpecB   12  22  31  41  61  61  71
SpecC   13  23  23  23  52  61  71

That aligned ID matrix will be used for clustering. As you can see, our tree will be:

((SpecA, SpecB), SpecC),

since SpecA and SpecB belong together to the family 41, and all three share the same kingdom 61.

So, the classification clusters the taxa not only based on the ranks (e.g. species, family, or phylum,...), but also on the actual taxonomy clade (specified by the IDs). Which means, it can cluster Arabidopsis within plant clade and snakes into reptiles. The more info you have in the aligned ID matrix (taxa with detailed taxonomy string; and enough taxa that cover all possible ranks), the higher resolution you can get for your taxonomy tree. For example, this is the first some lines of a real aligned ID matrix I am working with (not exactly the same as the one class2tree delivers, I just want give you an example how much data this matrix can/should have):

abbrName    ncbiID  fullName    strain  norank_491  norank_57678    norank_56615    norank_1086053  isolate forma   varietas    formaspecialis  subspecies  species norank_600669   norank_44542    speciessubgroup norank_2642122  norank_2619919  norank_2627626  norank_1919231  norank_38063    norank_2622230  norank_2642239  norank_2629106  norank_2618147  norank_2625095  norank_2641301  norank_2621923  norank_256003   norank_2627676  norank_2621901  norank_2644672  norank_1104572  norank_2642439  norank_1234666  norank_2619537  norank_2646925  norank_2643926  norank_2635519  norank_2638681  norank_2624677  norank_2685009  norank_2644212  norank_2631227  norank_2144190  norank_2622732  norank_69483    norank_2643768  norank_2641396  norank_2668073  norank_2648758  norank_1655640  norank_2639520  norank_2634974  norank_2620651  norank_2641160  norank_2634096  norank_2648975  norank_2684913  norank_1700837  norank_132567   norank_2626337  norank_2634179  norank_2629569  norank_2623058  norank_1727652  norank_2684908  norank_2636082  norank_886737   norank_2638413  norank_2638438  norank_2632030  norank_2630403  norank_1679063  norank_2648658  norank_2642017  norank_2631942  norank_2630810  norank_2625419  norank_2565780  norank_2629969  norank_2648942  norank_2648404  norank_2631116  norank_2624956  norank_2610901  norank_2499238  norank_2752537  norank_857194   norank_220671   norank_254878   norank_2641196  norank_2633939  norank_2678336  norank_2640012  speciesgroup    norank_44537    subgenus    genus   norank_1535325  subtribe    norank_1304792  norank_651142   norank_1593277  norank_2116545  norank_114584   norank_2683658  norank_218105   norank_1113537  norank_1142503  norank_588816   norank_1803510  family  norank_469895   norank_715341   tribe   norank_588815   norank_1648033  subfamily   norank_359160   norank_147370   norank_2601530  norank_337687   norank_1912919  superfamily norank_2601529  norank_74971    norank_104431   norank_43741    norank_37567    norank_43738    parvorder   norank_480117   norank_480118   infraorder  norank_33351    norank_33349    norank_33347    suborder    norank_33343    order   norank_159987   norank_33083    norank_355688   norank_9263 norank_91836    norank_314147   norank_6970 norank_4734 superorder  norank_71275    norank_186626   norank_1437201  norank_1489908  norank_1437010  subcohort   norank_1489872  norank_91827    norank_9347 norank_123369   norank_123368   norank_123367   norank_123366   norank_123365   cohort  norank_186625   norank_1489341  infraclass  norank_32525    norank_71240    norank_4447 subclass    norank_1437183  norank_1520881  norank_85512    class   norank_2283796  norank_2290931  norank_2692248  norank_2283794  norank_2683659  norank_2683660  norank_715962   norank_715989   norank_404260   norank_436492   norank_58024    norank_436491   norank_436489   norank_436486   norank_8492 norank_1329799  norank_32561    norank_8457 norank_32524    norank_78536    norank_58023    norank_716546   norank_3193 norank_32523    norank_1338369  superclass  norank_117571   norank_117570   norank_7776 norank_7742 subphylum   norank_197562   norank_197563   norank_716545   phylum  norank_1783276  norank_112252   norank_2611352  norank_2696291  norank_554915   norank_33630    norank_1206795  norank_33634    norank_2697495  norank_2698737  norank_3208 norank_1935183  norank_1783275  norank_2611341  norank_1798711  norank_68336    norank_1783257  norank_33511    norank_88770    norank_1783272  norank_1783270  norank_1206794  norank_33317    subkingdom  norank_33213    norank_6072 kingdom norank_999999005    norank_999999001    norank_999999002    norank_999999003    norank_999999004    norank_33154    superkingdom    norank_131567   root
ncbi100226  100226  Streptomyces coelicolor A3(2)   100226  100226  100226  100226  100226  100226  100226  100226  100226  100226  1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1902    1477431 1477431 1477431 1883    1883    1883    1883    1883    1883    1883    1883    1883    1883    1883    1883    1883    1883    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    2062    85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   85011   1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    1760    201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  201174  1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 1783272 2   131567  1
ncbi10090   10090   Mus musculus    10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   10090   862507  10088   10088   10088   10088   10088   10088   10088   10088   10088   10088   10088   10088   10088   10088   10066   10066   10066   10066   10066   10066   39107   39107   39107   39107   337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  1963758 1963758 9989    9989    9989    9989    9989    9989    314147  314147  314147  314146  314146  314146  314146  314146  1437010 1437010 1437010 1437010 9347    9347    9347    9347    9347    9347    9347    9347    9347    9347    32525   32525   32525   32525   32525   32525   32525   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   32524   32524   32524   32524   32524   32523   1338369 8287    117571  117570  7776    7742    89593   89593   89593   89593   7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    33511   33511   33511   33511   33511   33511   33511   33213   6072    33208   33208   999999001   999999002   999999003   999999004   33154   2759    131567  1
ncbi10116   10116   Rattus norvegicus   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10116   10114   10114   10114   10114   10114   10114   10114   10114   10114   10114   10114   10114   10114   10114   10066   10066   10066   10066   10066   10066   39107   39107   39107   39107   337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  337687  1963758 1963758 9989    9989    9989    9989    9989    9989    314147  314147  314147  314146  314146  314146  314146  314146  1437010 1437010 1437010 1437010 9347    9347    9347    9347    9347    9347    9347    9347    9347    9347    32525   32525   32525   32525   32525   32525   32525   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   40674   32524   32524   32524   32524   32524   32523   1338369 8287    117571  117570  7776    7742    89593   89593   89593   89593   7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    7711    33511   33511   33511   33511   33511   33511   33511   33213   6072    33208   33208   999999001   999999002   999999003   999999004   33154   2759    131567  1

Have I answered your question, Morgan? I hope that it can help! :-) Best, Vinh

sckott commented 4 years ago

thanks @trvinh - can you add some of that to the function documentation?

trvinh commented 4 years ago

thanks @trvinh - can you add some of that to the function documentation?

@sckott : yes, I can. But where should I add it to? Directly to the Rd block for each subfunction? But how can the users see that, since those functions are only internal?

sckott commented 4 years ago

I was thinking in the documentation for the class2tree function https://github.com/ropensci/taxize/blob/master/R/class2tree.R#L1-L52 - perhaps in the @details section

you can also add docs to specific internal methods if you want, and then use @keywords internal so that the manual file isn't exported (Seen in the listing of docs pages), but the user can still get to it by doing ?fxn-name

trvinh commented 4 years ago

I will try :-)

sckott commented 4 years ago

thank you!

morgan-sparks commented 4 years ago

Thanks to you both @sckott @trvinh, this was very helpful and a nice option for those of us trying to make basic distance based phylogenies that span very broad species lists.