grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

(Phyloseq Workflow) Inconsistent information about taxon_ids. #220

Open grabear opened 6 years ago

grabear commented 6 years ago

The taxon_ids that are present in the otu_table/tax_data tables are different, but related to the taxon_ids in the tax_table/diff_tables created by coercing the phyloseq object into a taxmap object as shown here.

While the different taxon_ids are obviously just part of the taxmap heirarchy, the confusion could be remedied by including the OTU label for taxon_ids mentioned in #219 that are legitimately identified/annotated organisms. While intermediate taxon_ids that are used to identify/accumulate data for each of the intermediate taxonomic ranks, are left without this OTU label.

The intermediate taxon_ids could also be given a group OTU label that gives access to the list of OTU labels that are included in that taxonomy.

grabear commented 6 years ago
> length(m$data$diff_table$wilcox_p_value)
[1] 420
> length(m$taxon_ids())
[1] 239
> length(unique(m$data$diff_table$taxon_id))
[1] 239
> m
<Taxmap>
  239 taxa: ac. Bacteria, af. Firmicutes ... ff. Anaeroplasmataceae, ky. Anaeroplasma
  239 edges: NA->ac, ac->af, af->av, av->bu, bu->dc ... ac->at, at->bs, bs->da, da->ff, ff->ky
  6 data sets:
    otu_table:
    # A tibble: 1,866 x 49
      taxon_id  Sample_1  Sample_2 Sample_3  Sample_6 Sample_9 Sample_10 Sample_13 Sample_14  Sample_17
      <chr>        <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
    1 fh       0.000438  0.0000118 0.000435 0.0000257 0.000245 0.0000896 0.0000253 0         0.00000735
    2 fi       0.0000438 0.0000589 0.000814 0.000103  0.000768 0.000381  0.000160  0.0000110 0.00000735
    3 fi       0.000728  0         0.000860 0.00237   0.000309 0.000179  0.0000422 0         0.0000294 
    # ... with 1,863 more rows, and 39 more variables: Sample_21 <dbl>, Sample_22 <dbl>,
    #   Sample_24 <dbl>, Sample_25 <dbl>, Sample_26 <dbl>, Sample_30 <dbl>, Sample_50 <dbl>,
    #   Sample_59 <dbl>, Sample_60 <dbl>, Sample_61 <dbl>, Sample_7 <dbl>, Sample_27 <dbl>,
    #   Sample_45 <dbl>, Sample_5 <dbl>, Sample_57 <dbl>, Sample_20 <dbl>, Sample_29 <dbl>,
    #   Sample_11 <dbl>, Sample_12 <dbl>, Sample_15 <dbl>, Sample_16 <dbl>, Sample_18 <dbl>,
    #   Sample_19 <dbl>, Sample_23 <dbl>, Sample_28 <dbl>, Sample_31 <dbl>, Sample_35 <dbl>,
    #   Sample_40 <dbl>, Sample_41 <dbl>, Sample_46 <dbl>, Sample_47 <dbl>, Sample_51 <dbl>,
    #   Sample_55 <dbl>, Sample_56 <dbl>, Sample_58 <dbl>, Sample_8 <dbl>, Sample_4 <dbl>,
    #   Sample_52 <dbl>, Sample_36 <dbl>
    tax_data:
    # A tibble: 1,866 x 8
      taxon_id Kingdom  Phylum     Class         Order           Family             Genus     Species  
      <chr>    <chr>    <chr>      <chr>         <chr>           <chr>              <chr>     <chr>    
    1 fh       Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae    Megamonas uncultur~
    2 fi       Bacteria Firmicutes Negativicutes Selenomonadales Acidaminococcaceae Phascola~ uncultur~
    3 fi       Bacteria Firmicutes Negativicutes Selenomonadales Acidaminococcaceae Phascola~ uncultur~
    # ... with 1,863 more rows
    sam_data:
    # A tibble: 48 x 9
      sample_ids X.SampleID BarcodeSequence LinkerPrimerSeque~ ForwardFastqFile     ReverseFastqFile   
      <chr>      <chr>      <chr>           <chr>              <chr>                <chr>              
    1 Sample_1   Sample_1   <NA>            <NA>               33749_S1_L001_R1_00~ 33749_S1_L001_R2_0~
    2 Sample_2   Sample_2   <NA>            <NA>               33739_S2_L001_R1_00~ 33739_S2_L001_R2_0~
    3 Sample_3   Sample_3   <NA>            <NA>               33737_S3_L001_R1_00~ 33737_S3_L001_R2_0~
    # ... with 45 more rows, and 3 more variables: TreatmentGroup <chr>, SampleName <chr>,
    #   Description <chr>
    phylo_tree:

      Phylogenetic tree with 1955 tips and 1954 internal nodes.

      Tip labels:
        New.CleanUp.ReferenceOTU177, New.ReferenceOTU1091, New.ReferenceOTU2352, EU774211.1.1284, New.ReferenceOTU1302, New.ReferenceOTU239, ...
      Node labels:
        Root, 0.917, 0.794, 0.853, 0.768, 0.880, ...

      Rooted; includes branch lengths.
    tax_table:
    # A tibble: 420 x 49
      taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17
      <chr>       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
    1 ac         0.987     0.994  0.982     0.991   0.983     0.988     0.986     0.991      0.988 
    2 af         0.473     0.370  0.468     0.397   0.392     0.632     0.436     0.363      0.385 
    3 ag         0.0370    0.131  0.00445   0.0327  0.00695   0.00538   0.00938   0.00385    0.0235
    # ... with 417 more rows, and 39 more variables: Sample_21 <dbl>, Sample_22 <dbl>, Sample_24 <dbl>,
    #   Sample_25 <dbl>, Sample_26 <dbl>, Sample_30 <dbl>, Sample_50 <dbl>, Sample_59 <dbl>,
    #   Sample_60 <dbl>, Sample_61 <dbl>, Sample_7 <dbl>, Sample_27 <dbl>, Sample_45 <dbl>,
    #   Sample_5 <dbl>, Sample_57 <dbl>, Sample_20 <dbl>, Sample_29 <dbl>, Sample_11 <dbl>,
    #   Sample_12 <dbl>, Sample_15 <dbl>, Sample_16 <dbl>, Sample_18 <dbl>, Sample_19 <dbl>,
    #   Sample_23 <dbl>, Sample_28 <dbl>, Sample_31 <dbl>, Sample_35 <dbl>, Sample_40 <dbl>,
    #   Sample_41 <dbl>, Sample_46 <dbl>, Sample_47 <dbl>, Sample_51 <dbl>, Sample_55 <dbl>,
    #   Sample_56 <dbl>, Sample_58 <dbl>, Sample_8 <dbl>, Sample_4 <dbl>, Sample_52 <dbl>,
    #   Sample_36 <dbl>
    diff_table:
    # A tibble: 420 x 11
      taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
      <chr>    <chr>       <chr>                   <dbl>       <dbl>     <dbl>          <dbl>
    1 ac       Stressed    Control               0.00272     0.00186   0.00106      0.313    
    2 af       Stressed    Control               0.322       0.0850    0.102        0.0000803
    3 ag       Stressed    Control              -0.542      -0.00454  -0.00455      0.219    
    # ... with 417 more rows, and 4 more variables: hartigan_dip_treat1 <dbl>,
    #   hartigan_dip_treat2 <dbl>, bimodality_coeff_treat1 <dbl>, bimodality_coeff_treat2 <dbl>
  0 functions:
grabear commented 6 years ago
> m$data$otu_table$taxon_id
   [1] "fh" "fi" "fi" "fi" "fi" "fi" "fj" "de" "fj" "fj" "fl" "fl" "fl" "fl" "fl" "fm" "fm" "fm" "fm" "fm" "fm"
  [22] "fj" "de" "fn" "df" "fp" "fp" "fp" "fq" "fq" "fr" "fn" "fr" "fn" "fs" "df" "ft" "ft" "ft" "fl" "fl" "fl"
  [43] "fl" "fl" "ft" "ft" "de" "de" "fj" "fu" "fu" "fv" "de" "de" "fx" "fx" "de" "fx" "fy" "fy" "fj" "fj" "fv"
  [64] "fv" "fm" "fs" "fs" "fs" "fz" "de" "ga" "fm" "fm" "fm" "fm" "fs" "gb" "gc" "de" "gd" "gd" "fm" "ge" "gd"
  [85] "gd" "gd" "de" "gf" "gf" "ge" "ge" "ge" "ge" "ge" "ge" "ge" "ge" "ge" "ge" "ge" "ge" "ge" "gd" "gg" "gg"
 [106] "gg" "gg" "gh" "gg" "gg" "gg" "fu" "fu" "fu" "gi" "de" "fs" "fs" "fs" "fs" "de" "fs" "fs" "fm" "de" "fs"
 [127] "fs" "gj" "fs" "fs" "fs" "gk" "gl" "gl" "fv" "gl" "de" "gm" "de" "gn" "gl" "fm" "de" "go" "gp" "de" "af"
 [148] "dc" "de" "gs" "gt" "gs" "gs" "gs" "gs" "gs" "gu" "gs" "gs" "gs" "gt" "gt" "gt" "gt" "gs" "gs" "gs" "gs"
 [169] "gs" "gb" "gb" "gt" "gt" "gt" "gs" "gs" "gv" "gv" "gw" "gw" "gu" "gu" "de" "gx" "gx" "gy" "gy" "bv" "ha"
 [190] "gx" "bv" "hb" "hc" "hd" "fj" "af" "af" "hb" "he" "he" "he" "he" "hf" "hg" "hg" "hg" "hg" "hh" "hi" "hi"
 [211] "hi" "hj" "dr" "hl" "hm" "oh" "hm" "hm" "oj" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "cf" "cf" "bd" "hs"
 [232] "hs" "hs" "hs" "hs" "hs" "hs" "hs" "ht" "hu" "hv" "hv" "hv" "hw" "ea" "ea" "hy" "hz" "df" "ib" "ib" "ic"
 [253] "ic" "ic" "ic" "ic" "ic" "ic" "ic" "ic" "ic" "ic" "ic" "ic" "df" "id" "ie" "ie" "id" "fn" "fn" "if" "de"
 [274] "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "ig" "de" "ht" "ht" "ih"
 [295] "ii" "gm" "gm" "gm" "gm" "gm" "de" "go" "fj" "gm" "gm" "ik" "ik" "gm" "gj" "gm" "gm" "il" "de" "gm" "de"
 [316] "gm" "ik" "de" "gm" "gm" "gm" "ht" "gl" "de" "de" "de" "im" "de" "im" "im" "fm" "fm" "ga" "ga" "de" "go"
 [337] "de" "de" "fj" "in" "in" "in" "io" "io" "io" "io" "io" "io" "io" "io" "io" "io" "io" "io" "io" "io" "in"
 [358] "io" "in" "io" "io" "io" "df" "in" "in" "in" "in" "in" "in" "in" "in" "io" "io" "in" "in" "in" "in" "in"
 [379] "in" "in" "io" "io" "io" "io" "io" "io" "io" "io" "io" "df" "io" "io" "io" "io" "df" "df" "df" "ip" "ip"
 [400] "iq" "ir" "is" "is" "fj" "gm" "de" "ge" "ed" "ed" "ed" "ee" "iv" "ee" "iv" "iv" "iv" "ee" "ee" "iv" "ix"
 [421] "ix" "ix" "ix" "iy" "ho" "ho" "ho" "iz" "ja" "ja" "de" "fy" "fy" "gm" "de" "de" "gm" "gm" "gm" "gj" "gm"
 [442] "gm" "gm" "gm" "gm" "gm" "gm" "gm" "de" "gm" "gm" "gj" "de" "gm" "de" "gm" "de" "gj" "gj" "gj" "gj" "gj"
 [463] "gj" "gj" "gm" "gj" "gj" "fj" "fj" "fj" "fj" "fj" "fj" "fj" "de" "gp" "gp" "fp" "fp" "fp" "fp" "fp" "fp"
 [484] "fp" "fp" "fp" "fp" "fp" "fp" "fp" "fp" "fp" "fp" "de" "de" "fj" "de" "de" "fn" "fn" "fr" "bv" "fn" "fn"
 [505] "io" "fi" "fi" "af" "fi" "fi" "fi" "fi" "fi" "fi" "fi" "fi" "fi" "fi" "af" "af" "af" "af" "af" "jb" "fi"
 [526] "fr" "gb" "gb" "gb" "ge" "df" "df" "df" "fq" "fq" "df" "fq" "df" "ac" "fn" "fn" "fn" "fn" "fn" "fn" "fn"
 [547] "df" "df" "bv" "bv" "ig" "df" "fs" "jd" "jd" "fr" "fr" "if" "if" "if" "bv" "je" "fr" "fr" "de" "jf" "jf"
 [568] "jf" "df" "qw" "jh" "jh" "ei" "iv" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho"
 [589] "ho" "ho" "ho" "ho" "ho" "ho" "ho" "ho" "jj" "jj" "rd" "jj" "gu" "gu" "gu" "gu" "gu" "gu" "gu" "gu" "jj"
 [610] "jj" "jj" "jj" "jj" "jj" "jj" "jj" "re" "re" "re" "re" "rd" "rd" "jj" "jj" "jk" "hv" "hv" "hv" "hv" "bd"
 [631] "bd" "cq" "cq" "cq" "cq" "cq" "cq" "cq" "cq" "jo" "jo" "ib" "ib" "ib" "ib" "hb" "hb" "jd" "jp" "de" "df"
 [652] "fr" "fr" "fq" "df" "df" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr"
 [673] "fr" "jd" "jh" "jh" "jh" "ho" "de" "fj" "fr" "ge" "fj" "jd" "jp" "fj" "fx" "ee" "jq" "jq" "ee" "hs" "hs"
 [694] "hs" "hs" "iv" "iv" "iv" "iv" "jr" "ho" "rq" "ix" "iv" "iv" "ee" "hs" "hs" "hs" "hs" "hs" "hs" "hs" "hs"
 [715] "hs" "hs" "hs" "hs" "hs" "hs" "hs" "hs" "hs" "hs" "ee" "eq" "eq" "eq" "eq" "eq" "eq" "jj" "iv" "ch" "iv"
 [736] "iv" "iv" "iv" "iv" "iv" "iv" "iv" "iv" "iv" "iv" "iv" "iv" "jv" "iv" "iv" "iv" "iv" "iv" "dy" "dy" "iv"
 [757] "jv" "dy" "iv" "iv" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "dy" "iv" "iv"
 [778] "iv" "dy" "iv" "ee" "hz" "ib" "de" "fl" "es" "jj" "jj" "jj" "jj" "jj" "jj" "rd" "jj" "jj" "jj" "jj" "jj"
 [799] "jj" "jj" "jj" "jj" "jj" "jj" "jj" "jj" "jj" "jj" "jj" "jj" "af" "af" "fx" "hc" "jz" "jz" "jz" "es" "hb"
 [820] "hb" "hb" "es" "hb" "hb" "hb" "hb" "hb" "hb" "hb" "hb" "hb" "hb" "hb" "hb" "hb" "es" "es" "es" "kb" "kb"
 [841] "eu" "jo" "jo" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je"
 [862] "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je"
 [883] "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "je" "fp" "hz" "hz" "fq" "fr"
 [904] "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr"
 [925] "fr" "fr" "fr" "fr" "df" "ja" "ja" "ja" "ja" "ja" "ja" "ja" "ja" "ib" "ib" "ib" "ib" "ib" "ib" "ib" "ib"
 [946] "ib" "ib" "ib" "kd" "fr" "fr" "fr" "fr" "fr" "df" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "fr" "ke" "fp" "fp"
 [967] "fq" "fq" "fr" "fr" "fr" "fr" "fr" "df" "fp" "df" "kd" "df" "df" "df" "fr" "fp" "fp" "fq" "fq" "fr" "fq"
 [988] "fq" "fq" "fq" "fq" "fr" "fr" "fq" "fq" "fq" "fq" "fq" "fq" "fq"
 [ reached getOption("max.print") -- omitted 866 entries ]
> unique(m$data$otu_table$taxon_id)
  [1] "fh" "fi" "fj" "de" "fl" "fm" "fn" "df" "fp" "fq" "fr" "fs" "ft" "fu" "fv" "fx" "fy" "fz" "ga" "gb" "gc"
 [22] "gd" "ge" "gf" "gg" "gh" "gi" "gj" "gk" "gl" "gm" "gn" "go" "gp" "af" "dc" "gs" "gt" "gu" "gv" "gw" "gx"
 [43] "gy" "bv" "ha" "hb" "hc" "hd" "he" "hf" "hg" "hh" "hi" "hj" "dr" "hl" "hm" "oh" "oj" "ho" "cf" "bd" "hs"
 [64] "ht" "hu" "hv" "hw" "ea" "hy" "hz" "ib" "ic" "id" "ie" "if" "ig" "ih" "ii" "ik" "il" "im" "in" "io" "ip"
 [85] "iq" "ir" "is" "ed" "ee" "iv" "ix" "iy" "iz" "ja" "jb" "ac" "jd" "je" "jf" "qw" "jh" "ei" "jj" "rd" "re"
[106] "jk" "cq" "jo" "jp" "jq" "jr" "rq" "eq" "ch" "jv" "dy" "es" "jz" "kb" "eu" "kd" "ke" "kf" "kg" "kh" "ki"
[127] "ep" "kl" "km" "kn" "kp" "kq" "tc" "kr" "te" "ks" "kt" "ku" "kv" "kx" "ky" "kz" "ec" "lb" "lc" "ld" "le"
[148] "lf" "lg" "lh"
> m$data$tax_table$taxon_id
  [1] "ac" "af" "ag" "ah" "ai" "aj" "ak" "al" "an" "ao" "ap" "aq" "as" "at" "av" "aw" "ay" "az" "ba" "bb" "bc"
 [22] "bd" "be" "bf" "bg" "bh" "bi" "bk" "bl" "bm" "bn" "bo" "bp" "br" "bs" "bu" "bv" "bx" "by" "bz" "ca" "cb"
 [43] "cc" "cd" "ce" "cf" "bd" "ch" "ci" "cj" "ck" "cl" "cn" "co" "bd" "cq" "cr" "cs" "ct" "cu" "cv" "cw" "cx"
 [64] "cz" "da" "dc" "dd" "de" "df" "dh" "di" "dj" "dl" "dm" "dn" "do" "dp" "dq" "dr" "ds" "dt" "cf" "bd" "dx"
 [85] "dy" "dz" "ea" "eb" "ec" "ed" "ee" "ef" "eh" "ei" "ej" "ek" "bd" "cq" "eo" "ep" "eq" "ch" "es" "et" "eu"
[106] "ew" "ex" "ey" "ez" "fa" "fb" "fc" "fe" "ff" "fh" "fi" "fj" "de" "fl" "fm" "fn" "fp" "fq" "fr" "fs" "ft"
[127] "fu" "fv" "fx" "fy" "fz" "ga" "gb" "gc" "gd" "ge" "gf" "gg" "gh" "gi" "gj" "gk" "gl" "gm" "gn" "go" "gp"
[148] "gs" "gt" "gu" "gv" "gw" "gx" "gy" "ha" "hb" "hc" "hd" "he" "hf" "hg" "hh" "hi" "hj" "hl" "hm" "hn" "ho"
[169] "cf" "bd" "hs" "ht" "hu" "hv" "hw" "ea" "hy" "hz" "df" "ib" "ic" "id" "ie" "if" "ig" "ih" "ii" "de" "ik"
[190] "il" "im" "in" "io" "ip" "iq" "ir" "is" "ed" "ee" "iv" "ee" "ix" "iy" "iz" "ja" "jb" "jd" "je" "jf" "jg"
[211] "jh" "ei" "jj" "jk" "bd" "cq" "jo" "jp" "jq" "jr" "js" "eq" "ch" "jv" "dy" "es" "jz" "kb" "eu" "kd" "ke"
[232] "kf" "kg" "kh" "ki" "ep" "kl" "km" "kn" "dc" "kp" "kq" "kr" "ks" "kt" "ku" "kv" "kx" "ky" "kz" "ec" "lb"
[253] "lc" "ld" "le" "lf" "lg" "lh" "fh" "fi" "fj" "de" "fj" "fl" "fm" "de" "fn" "fp" "fq" "fr" "ft" "fu" "fv"
[274] "fx" "fy" "fy" "fs" "fz" "ga" "fs" "gd" "gf" "ge" "gg" "gh" "gi" "gj" "gl" "gl" "gm" "gn" "go" "gp" "gs"
[295] "gt" "gu" "gb" "gv" "gw" "gx" "gy" "ha" "hb" "hc" "hd" "he" "hg" "hh" "hi" "hl" "oh" "hm" "oj" "ho" "ho"
[316] "cf" "bd" "hs" "ht" "hu" "hv" "ea" "hy" "hz" "df" "ib" "ic" "id" "ie" "if" "ig" "ht" "ih" "ii" "de" "ik"
[337] "gm" "im" "in" "io" "io" "ip" "iq" "is" "ed" "ee" "iv" "ee" "ix" "iy" "iz" "gj" "jb" "gb" "jd" "je" "jf"
[358] "qw" "jh" "ei" "ho" "jj" "rd" "re" "jk" "bd" "cq" "jo" "jp" "fr" "fx" "jq" "jr" "rq" "hs" "eq" "ch" "jv"
[379] "dy" "es" "hc" "jz" "kb" "eu" "ja" "ja" "kd" "ke" "fq" "kf" "kg" "kh" "ki" "ep" "kl" "km" "kn" "kp" "kq"
[400] "tc" "te" "kt" "ku" "kv" "kv" "kx" "ky" "ky" "ec" "lb" "lc" "io" "ld" "jv" "le" "lf" "lf" "lg" "lh" "ii"
> unique(m$data$tax_table$taxon_id)
  [1] "ac" "af" "ag" "ah" "ai" "aj" "ak" "al" "an" "ao" "ap" "aq" "as" "at" "av" "aw" "ay" "az" "ba" "bb" "bc"
 [22] "bd" "be" "bf" "bg" "bh" "bi" "bk" "bl" "bm" "bn" "bo" "bp" "br" "bs" "bu" "bv" "bx" "by" "bz" "ca" "cb"
 [43] "cc" "cd" "ce" "cf" "ch" "ci" "cj" "ck" "cl" "cn" "co" "cq" "cr" "cs" "ct" "cu" "cv" "cw" "cx" "cz" "da"
 [64] "dc" "dd" "de" "df" "dh" "di" "dj" "dl" "dm" "dn" "do" "dp" "dq" "dr" "ds" "dt" "dx" "dy" "dz" "ea" "eb"
 [85] "ec" "ed" "ee" "ef" "eh" "ei" "ej" "ek" "eo" "ep" "eq" "es" "et" "eu" "ew" "ex" "ey" "ez" "fa" "fb" "fc"
[106] "fe" "ff" "fh" "fi" "fj" "fl" "fm" "fn" "fp" "fq" "fr" "fs" "ft" "fu" "fv" "fx" "fy" "fz" "ga" "gb" "gc"
[127] "gd" "ge" "gf" "gg" "gh" "gi" "gj" "gk" "gl" "gm" "gn" "go" "gp" "gs" "gt" "gu" "gv" "gw" "gx" "gy" "ha"
[148] "hb" "hc" "hd" "he" "hf" "hg" "hh" "hi" "hj" "hl" "hm" "hn" "ho" "hs" "ht" "hu" "hv" "hw" "hy" "hz" "ib"
[169] "ic" "id" "ie" "if" "ig" "ih" "ii" "ik" "il" "im" "in" "io" "ip" "iq" "ir" "is" "iv" "ix" "iy" "iz" "ja"
[190] "jb" "jd" "je" "jf" "jg" "jh" "jj" "jk" "jo" "jp" "jq" "jr" "js" "jv" "jz" "kb" "kd" "ke" "kf" "kg" "kh"
[211] "ki" "kl" "km" "kn" "kp" "kq" "kr" "ks" "kt" "ku" "kv" "kx" "ky" "kz" "lb" "lc" "ld" "le" "lf" "lg" "lh"
[232] "oh" "oj" "qw" "rd" "re" "rq" "tc" "te"
zachary-foster commented 6 years ago

The taxon_ids that are present in the otu_table/tax_data tables are different, but related to the taxon_ids in the tax_table/diff_tables created by coercing the phyloseq object into a taxmap object as shown here.

I am not sure I understand. All of the taxon IDs should come from the same set, the result of m$taxon_ids(). If there are taxon IDs (besides NA) that are not in this set, then that is either a bug or the result of filtering using something besides filter_*. The otu_table/tax_data tables will usually only contain "leaf" taxa (e.g. species or genus) since that is what OTUs are usually assigned to, whereas the tax_table/diff_tables can have all taxa, including intermediates, so that could be why they are different subsets of a common set of IDs.

While the different taxon_ids are obviously just part of the taxmap heirarchy, the confusion could be remedied by including the OTU label for taxon_ids mentioned in #219 that are legitimately identified/annotated organisms. While intermediate taxon_ids that are used to identify/accumulate data for each of the intermediate taxonomic ranks, are left without this OTU label.

This is a good idea for specific dataset where the " legitimately identified/annotated organisms" are known, but I am not sure how this could be abstracted to data of unknown characteristics. What about when multiple OTUs match a single taxon? If you know that there is one OTU per taxon, you can do something like the following:

> library(metacoder)
> print(ex_taxmap)
<Taxmap>
  17 taxa: b. Mammalia, c. Plantae, d. Felidae, e. Notoryctidae, f. Hominidae ... o. typhlops, p. sapiens, q. lycopersicum, r. tuberosum
  17 edges: NA->b, NA->c, b->d, b->e, b->f, c->g, d->h, d->i, e->j, f->k, g->l, h->m, i->n, j->o, k->p, l->q, l->r
  4 data sets:
    info:
      # A tibble: 6 x 4
        taxon_id name  n_legs dangerous
        <chr>    <chr>  <dbl> <lgl>    
      1 m        tiger   4.00 T        
      2 n        cat     4.00 F        
      3 o        mole    4.00 F        
      # ... with 3 more rows
    phylopic_ids: a named vector of 'character' with 6 items
       m. e148eabb-f138-43c6-b1e4-5cda2180485a, n. 12899ba0-9923-4feb-a7f9-758c3c7d5e13 ... r. 63604565-0406-460b-8cb8-1abe954b3f3a
    foods: a list of 6 items named by taxa:
       m, n, o, p, q, r
    abund:
      # A tibble: 8 x 5
        taxon_id code  sample_id count taxon_index
        <chr>    <fct> <fct>     <dbl>       <int>
      1 m        T     A          1.00           1
      2 n        C     A          2.00           2
      3 o        M     B          5.00           3
      # ... with 5 more rows
  1 functions:
 reaction
> (new_id_key <- ex_taxmap$map_data(taxon_ids, name))
       b        c        d        e        f        g        h        i        j        k        l        m        n        o        p        q 
      NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA  "tiger"    "cat"   "mole"  "human" "tomato" 
       r 
"potato" 
> (new_ids <- ifelse(is.na(new_id_key), names(new_id_key), new_id_key))
       b        c        d        e        f        g        h        i        j        k        l        m        n        o        p        q 
     "b"      "c"      "d"      "e"      "f"      "g"      "h"      "i"      "j"      "k"      "l"  "tiger"    "cat"   "mole"  "human" "tomato" 
       r 
"potato" 
> ex_taxmap$replace_taxon_ids(new_ids = new_ids)
<Taxmap>
  17 taxa: b. Mammalia, c. Plantae, d. Felidae, e. Notoryctidae ... mole. typhlops, human. sapiens, tomato. lycopersicum, potato. tuberosum
  17 edges: NA->b, NA->c, b->d, b->e, b->f, c->g, d->h, d->i, e->j, f->k, g->l, h->tiger, i->cat, j->mole, k->human, l->tomato, l->potato
  4 data sets:
    info:
      # A tibble: 6 x 4
        taxon_id name  n_legs dangerous
        <chr>    <chr>  <dbl> <lgl>    
      1 tiger    tiger   4.00 T        
      2 cat      cat     4.00 F        
      3 mole     mole    4.00 F        
      # ... with 3 more rows
    phylopic_ids: a named vector of 'character' with 6 items
       tiger. e148eabb-f138-43c6-b1e4-5cda2180485a ... potato. 63604565-0406-460b-8cb8-1abe954b3f3a
    foods: a list of 6 items named by taxa:
       tiger, cat, mole, human, tomato, potato
    abund:
      # A tibble: 8 x 5
        taxon_id code  sample_id count taxon_index
        <chr>    <fct> <fct>     <dbl>       <int>
      1 tiger    T     A          1.00           1
      2 cat      C     A          2.00           2
      3 mole     M     B          5.00           3
      # ... with 5 more rows
  1 functions:
 reaction

Another thing I often do is add OTU ids as a "rank" in the taxonomy. So there is an OTU "taxon" below species, or whatever it is assigned to. This way OTUs show up as nodes in the heat_trees. This is pretty easy to do when parsing your data from tables. Just add the OTU id column name to the class_cols option of parse_tax_data. Would this do what you want? I could easily add that as a T/F option to the phyloseq parser.

The intermediate taxon_ids could also be given a group OTU label that gives access to the list of OTU labels that are included in that taxonomy.

That information can be gotten for any table using the obs function with the value option. EG:

> obs(ex_taxmap, "info", value = "name")
$b
      m       n       o       p 
"tiger"   "cat"  "mole" "human" 

$c
       q        r 
"tomato" "potato" 

$d
      m       n 
"tiger"   "cat" 

$e
     o 
"mole" 

$f
      p 
"human" 

$g
       q        r 
"tomato" "potato" 

$h
      m 
"tiger" 

$i
    n 
"cat" 

$j
     o 
"mole" 

$k
      p 
"human" 

$l
       q        r 
"tomato" "potato" 

$m
      m 
"tiger" 

$n
    n 
"cat" 

$o
     o 
"mole" 

$p
      p 
"human" 

$q
       q 
"tomato" 

$r
       r 
"potato" 

This would also be another way to find new taxon IDs if you wanted to. Just replace the ids for any taxon with one OTU with the otu's ID.

Thanks for the thoughts!