joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
586 stars 186 forks source link

Possible Feature? estimate_presence? #1639

Open mawa86 opened 2 years ago

mawa86 commented 2 years ago

Hey Joey, thanks for this incredible tool! I have to admit it might sound like a stupid question as I am not that experienced with R, but ill give it a go:

Im looking for an easy way to automatically create a simple yes/no (or positive/negative) variable in sample_data for a given taxon, which would assign a value according to a user-defined threshold value in the otu_table. for example:

estimate_presence(PhyloseqObject, variable= "Strep", rank= genus, tax= "Streptococcus", threshold= 0.02)

...returning a variable ("Strep") in sample_data where for each sample a positive/negative entry is assigned according to whether more or less than 2% of reads are assigned to the OTU corresponding to the genus Streptococcus.

Could it be of interest? and does anyone want to point me in the right direction on how to achieve something like that in R at the moment? maybe @ycl6 can help me?

Any help very appreciated! thanks Martin

ycl6 commented 2 years ago

Hi @mawa86

What you suggests is doable, but I don't think this will get added officially into the phyloseq package (even if people find this useful) as phyloseq is not actively maintained by the developer(s) anymore.

mawa86 commented 2 years ago

Thank you for your reply @ycl6 !

with your admirable R skills, would you be able to show me how to achieve something like that?

martin

david-barnett commented 2 years ago

You might be interested in the tax_transform functionality from microViz, which works with phyloseq objects and can solve your problem (disclaimer, I'm the maintainer)

library(microViz)
#> microViz version 0.10.3 - Copyright (C) 2022 David Barnett
#> ! Website: https://david-barnett.github.io/microViz
#> ✔ Useful?  For citation details, run: `citation("microViz")`
#> ✖ Silence? `suppressPackageStartupMessages(library(microViz))`

# get some example phyloseq data 
data("shao19")

# subset samples for a smaller dataset
psq <- shao19 %>% ps_filter(age == 38) 

# inside we see the count data
otu_get(psq, taxa = 1:5, samples = 1:5)
#> OTU Table:          [5 taxa and 5 samples]
#>                      taxa are columns
#>           Escherichia coli Bacteroides caccae Bacteroides stercoris
#> B01089_mo          7371972             595532                746435
#> B01471_mo            59606                  0               1986747
#> B01716_mo                0                  0                     0
#> B01771_mo                0             576325               3137262
#> B02098_mo           723595              21251                 20985
#>           Ruminococcus bromii [Eubacterium] rectale
#> B01089_mo                   0                133105
#> B01471_mo             1201752                281075
#> B01716_mo                   0                 78437
#> B01771_mo              988615                220861
#> B02098_mo              308852                222002

# we can transform the counts in various ways, e.g. to proportions
compositional <- psq %>% tax_transform("compositional") 
otu_get(compositional, taxa = 1:5, samples = 1:5)
#> OTU Table:          [5 taxa and 5 samples]
#>                      taxa are columns
#>           Escherichia coli Bacteroides caccae Bacteroides stercoris
#> B01089_mo      0.416323960        0.033632011           0.042154090
#> B01471_mo      0.002955227        0.000000000           0.098501647
#> B01716_mo      0.000000000        0.000000000           0.000000000
#> B01771_mo      0.000000000        0.029512358           0.160652408
#> B02098_mo      0.034895678        0.001024839           0.001012011
#>           Ruminococcus bromii [Eubacterium] rectale
#> B01089_mo          0.00000000           0.007516958
#> B01471_mo          0.05958210           0.013935519
#> B01716_mo          0.00000000           0.003611266
#> B01771_mo          0.05062484           0.011309815
#> B02098_mo          0.01489452           0.010706141

# or to presence/absence binary indicators
presenceAbsence <- psq %>% tax_transform("binary")
otu_get(presenceAbsence, taxa = 1:5, samples = 1:5)
#> OTU Table:          [5 taxa and 5 samples]
#>                      taxa are columns
#>           Escherichia coli Bacteroides caccae Bacteroides stercoris
#> B01089_mo                1                  1                     1
#> B01471_mo                1                  0                     1
#> B01716_mo                0                  0                     0
#> B01771_mo                0                  1                     1
#> B02098_mo                1                  1                     1
#>           Ruminococcus bromii [Eubacterium] rectale
#> B01089_mo                   0                     1
#> B01471_mo                   1                     1
#> B01716_mo                   0                     1
#> B01771_mo                   1                     1
#> B02098_mo                   1                     1

# for your example request, slightly more complex but possible:
presenceAbsenceGenus_0.02 <- psq %>% 
  tax_transform("compositional", rank = "genus") %>% 
  tax_transform("binary", undetected = 0.02, chain = TRUE)

otu_get(presenceAbsenceGenus_0.02, taxa = 1:5, samples = 1:5)
#> OTU Table:          [5 taxa and 5 samples]
#>                      taxa are columns
#>           Escherichia Bacteroides Ruminococcus Clostridia class Bifidobacterium
#> B01089_mo           1           1            0                0               1
#> B01471_mo           0           1            1                0               1
#> B01716_mo           0           1            0                0               1
#> B01771_mo           0           1            1                1               1
#> B02098_mo           1           1            0                0               1

# phyloseq then provides a (confusingly named) function to get the sample
# abundance vector for a named taxon, e.g. "Streptococcus")
strepPresent <- phyloseq::get_sample(presenceAbsenceGenus_0.02, "Streptococcus") 

# you could add this back on to your original sample data as a T/F vector like so
psq <- psq %>% ps_mutate(Strep0.02 = as.logical(strepPresent))
phyloseq::sample_data(psq)[1:5, "Strep0.02"]
#>           Strep0.02
#> B01089_mo     FALSE
#> B01471_mo      TRUE
#> B01716_mo     FALSE
#> B01771_mo     FALSE
#> B02098_mo     FALSE

# for fun let's see some composition heatmaps to illustrate the transformations
psq %>% comp_heatmap(taxa = 1:25)
#> Registered S3 method overwritten by 'seriation':
#>   method         from 
#>   reorder.hclust vegan

compositional %>% comp_heatmap(taxa = 1:25)

presenceAbsence %>% comp_heatmap(taxa = 1:25, show_heatmap_legend = FALSE)

presenceAbsenceGenus_0.02 %>% comp_heatmap(taxa = 1:25, show_heatmap_legend = FALSE)

Created on 2022-11-18 with reprex v2.0.2

ycl6 commented 2 years ago

Hi @mawa86

Hope the solution offered by @david-barnett answers your question.

mawa86 commented 2 years ago

hi @david-barnett and @ycl6

Thanks very much for this tool David, looks neat!

thanks to both for the time

/Martin