waldronlab / curatedMetagenomicData

Curated Metagenomic Data of the Human Microbiome
https://waldronlab.io/curatedMetagenomicData
Artistic License 2.0
126 stars 28 forks source link

add method or advice for conversion to phyloseq objects #225

Closed schifferl closed 3 years ago

schifferl commented 3 years ago

Is your feature request related to a problem? Please describe. The phyloseq package is the de facto standard for analysis of metagenomics data in R/Bioconductor and conversion to their phyloseq class should be provided in curatedMetagenomicData.

Describe the solution you'd like If there is another well-maintained package that already has this functionality, we could use their method. I think that's unlikely, but we should at least look first before writing the method from scratch. Otherwise this method would be implemented by: 1) spliting the rownames of the TreeSummarizedExperiment into rowData to be used as the tax_table of the phyloseq object, 2) extracting the colData as the sample_data of the phyloseq object, 3) extracting the relative_abundance (already converted to counts) matrix as the otu_table of the phyloseq object, and 4) inserting the phylogenetic tree from the TreeSummarizedExperiment in the right place.

Describe alternatives you've considered See above, the method would be borrowed from another package if possible and we would simply advise users of availability. Ideally, the method should be in the phyloseq package itself, but I don't think it is.

Additional context The method should be named asPhyloseq (no dot because methods dispatch is not really being called). This issue is closely related to issues #214 (my recent comments) and #224. It replaces and closes #190.

schifferl commented 3 years ago

If you are up for it, can you investigate @sdgamboa?

sdgamboa commented 3 years ago

@schifferl, sure, I'll investigate about it.

sdgamboa commented 3 years ago

@schifferl, I found a Bioconductor package named mia that has a function to convert from TreeSummarizedExperiment to phyloseq, although I think the package is still in devel version. I leave an example below.

# BiocManager::install("microbiome/mia")
packages <- c("mia", "TreeSummarizedExperiment", "phyloseq", "HMP16SData")
for (pkg in packages) suppressMessages(library(pkg, character.only = TRUE))

# A SummarziedExperiment-class object
se <- suppressMessages(V35())
se
#> class: SummarizedExperiment 
#> dim: 45383 4743 
#> metadata(2): experimentData phylogeneticTree
#> assays(1): 16SrRNA
#> rownames(45383): OTU_97.1 OTU_97.10 ... OTU_97.9998 OTU_97.9999
#> rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
#> colnames(4743): 700013549 700014386 ... 700114717 700114750
#> colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID

# A TreeSummarizedExperiment-class object
tse <- TreeSummarizedExperiment(
    assays = list(count = assays(se)$`16SrRNA`),
    rowData = rowData(se),
    rowTree = metadata(se)$phylogeneticTree,
    colData = colData(se),
)
#> Warning: 47 row(s) couldn't be matched to the tree and are/is removed.
tse
#> class: TreeSummarizedExperiment 
#> dim: 45336 4743 
#> metadata(0):
#> assays(1): count
#> rownames(45336): OTU_97.1 OTU_97.10 ... OTU_97.9998 OTU_97.9999
#> rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
#> colnames(4743): 700013549 700014386 ... 700114717 700114750
#> colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (45336 rows)
#> rowTree: 1 phylo tree(s) (45364 leaves)
#> colLinks: NULL
#> colTree: NULL

# A phyloseq-class object
ps <- mia::makePhyloseqFromTreeSummarizedExperiment(tse, abund_values = "count")
ps
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 45336 taxa and 4743 samples ]
#> sample_data() Sample Data:       [ 4743 samples by 7 sample variables ]
#> tax_table()   Taxonomy Table:    [ 45336 taxa by 5 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 45336 tips and 45099 internal nodes ]

# Note: Columns that do not match "kingdom", "class", "order", "family", "genus",
# or "species" are removed from tax_table
colnames(rowData(tse))
#> [1] "CONSENSUS_LINEAGE" "SUPERKINGDOM"      "PHYLUM"           
#> [4] "CLASS"             "ORDER"             "FAMILY"           
#> [7] "GENUS"
colnames(tax_table(ps))
#> [1] "PHYLUM" "CLASS"  "ORDER"  "FAMILY" "GENUS"

Created on 2021-05-03 by the reprex package (v2.0.0)

lwaldron commented 3 years ago

Nice find! I’d say if mia is being released too, you could just document this in the vignette, and mention it in the ExpressionSet2phyloseq defunct error.--

Levi Waldron

Associate Professor

Department of Epidemiology and Biostatistics

CUNY Graduate School of Public Health and Health Policy

Institute for Implementation Science in Population Health

55 W 125th St, New York NY 10035

https://waldronlab.io

lwaldron commented 3 years ago

Check that this works e.g. on the Asnicar TreeSummarizedExperiment.

sdgamboa commented 3 years ago

@lwaldron , @schifferl , the mia::makePhyloseqFromTreeSummarizedExperiment function seems to work for cMD even without rowData. I leave an example and the definition of the method below (the output is a bit long).

packages <-  c("curatedMetagenomicData", "mia", "TreeSummarizedExperiment",
               "phyloseq")
for (pkg in packages) suppressMessages(library(pkg, character.only = TRUE))

# Import data from cMD
tse <- curatedMetagenomicData("AsnicarF_2021.relative_abundance", dryrun = FALSE)
#> snapshotDate(): 2021-05-05
#> Warning: 380 row(s) couldn't be matched to the tree and are/is removed.
tse
#> $`2021-03-31.AsnicarF_2021.relative_abundance`
#> class: TreeSummarizedExperiment 
#> dim: 639 1098 
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(639):
#>   k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_vulgatus
#>   k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_stercoris
#>   ...
#>   k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Pyramidobacter|s__Pyramidobacter_sp_C12_8
#>   k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Brevibacteriaceae|g__Brevibacterium|s__Brevibacterium_aurantiacum
#> rowData names(0):
#> colnames(1098): SAMEA7041133 SAMEA7041134 ... SAMEA7045952 SAMEA7045953
#> colData names(20): studyName subjectID ... curator NCBI_accession
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (639 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL

# Convert to phyloseq-class object
ps <- makePhyloseqFromTreeSummarizedExperiment(tse[[1]], abund_value = "relative_abundance")
ps
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 639 taxa and 1098 samples ]
#> sample_data() Sample Data:       [ 1098 samples by 20 sample variables ]
#> phy_tree()    Phylogenetic Tree: [ 639 tips and 638 internal nodes ]

# How data looks like
otu_table(ps) |> 
    rownames() |>
    head()
#> [1] "k__Archaea|p__Euryarchaeota|c__Thermoplasmata|o__Methanomassiliicoccales|f__Methanomassiliicoccaceae|g__Methanomassiliicoccus|s__Candidatus_Methanomassiliicoccus_intestinalis"
#> [2] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_cuniculi"                                      
#> [3] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_stadtmanae"                                    
#> [4] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_arboriphilus"                          
#> [5] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_oralis"                                
#> [6] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii"

otu_table(ps) |> 
    colnames() |>
    head()
#> [1] "SAMEA7041133" "SAMEA7041134" "SAMEA7041135" "SAMEA7041136" "SAMEA7041137"
#> [6] "SAMEA7041138"

sample_data(ps) |>
    lapply(\(y) if (is.character(y)) {as.factor(y)} else {y}) |>
    as.data.frame() |>
    summary()
#>          studyName      subjectID    body_site    antibiotics_current_use
#>  AsnicarF_2021:1098   s10091 :   1   stool:1098   no:1098                
#>                       s10092 :   1                                       
#>                       s10181 :   1                                       
#>                       s10182 :   1                                       
#>                       s10332 :   1                                       
#>                       s10691 :   1                                       
#>                       (Other):1092                                       
#>  study_condition    disease          age           age_category     gender   
#>  control:1098    healthy:1098   Min.   :18.00   adult    :1096   female:792  
#>                                 1st Qu.:36.00   schoolage:   2   male  :306  
#>                                 Median :46.00                                
#>                                 Mean   :44.81                                
#>                                 3rd Qu.:54.00                                
#>                                 Max.   :65.00                                
#>                                                                              
#>  country    non_westernized      sequencing_platform    DNA_extraction_kit
#>  GBR:1001   no:1098         IlluminaNovaSeq:1098     PowerSoilPro:1098    
#>  USA:  97                                                                 
#>                                                                           
#>                                                                           
#>                                                                           
#>                                                                           
#>                                                                           
#>        PMID       number_reads       number_bases       minimum_read_length
#>  33432175:1098   Min.   : 5992767   Min.   :9.049e+08   Min.   :150        
#>                  1st Qu.:25336668   1st Qu.:3.826e+09   1st Qu.:151        
#>                  Median :27425643   Median :4.141e+09   Median :151        
#>                  Mean   :29807717   Mean   :4.501e+09   Mean   :151        
#>                  3rd Qu.:31851070   3rd Qu.:4.808e+09   3rd Qu.:151        
#>                  Max.   :93097010   Max.   :1.406e+10   Max.   :151        
#>                                                                            
#>  median_read_length                           curator        NCBI_accession
#>  Min.   :151        Francesco_Asnicar;Paolo_Manghi:1098   ERR4330026:   1  
#>  1st Qu.:151                                              ERR4330027:   1  
#>  Median :151                                              ERR4330028:   1  
#>  Mean   :151                                              ERR4330029:   1  
#>  3rd Qu.:151                                              ERR4330030:   1  
#>  Max.   :151                                              ERR4330031:   1  
#>                                                           (Other)   :1092

phy_tree(ps)
#> 
#> Phylogenetic tree with 639 tips and 638 internal nodes.
#> 
#> Tip labels:
#>   k__Archaea|p__Euryarchaeota|c__Thermoplasmata|o__Methanomassiliicoccales|f__Methanomassiliicoccaceae|g__Methanomassiliicoccus|s__Candidatus_Methanomassiliicoccus_intestinalis, k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_cuniculi, k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_stadtmanae, k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_arboriphilus, k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_oralis, k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii, ...
#> 
#> Rooted; includes branch lengths.

# Method definition
findMethods(makePhyloseqFromTreeSummarizedExperiment)
#> An object of class  "listOfMethods" 
#> $SummarizedExperiment
#> Method Definition:
#> 
#> function (x, ...) 
#> {
#>     .local <- function (x, abund_values = "counts") 
#>     {
#>         .require_package("phyloseq")
#>         if (!all(dim(x) > 0)) {
#>             stop("'x' contains zero rows. 'x' can not be converted\n                 to a phyloseq object.", 
#>                 call. = FALSE)
#>         }
#>         .check_assay_present(abund_values, x)
#>         args = list()
#>         otu_table <- as.matrix(assay(x, abund_values))
#>         otu_table <- phyloseq::otu_table(otu_table, taxa_are_rows = TRUE)
#>         args[["otu_table"]] <- otu_table
#>         if (!(length(rowData(x)[, taxonomyRanks(x)]) == 0 || 
#>             is.null((rowData(x)[, taxonomyRanks(x)])))) {
#>             tax_table <- as.matrix(rowData(x)[, taxonomyRanks(x)])
#>             tax_table <- phyloseq::tax_table(tax_table)
#>             args[["tax_table"]] <- tax_table
#>         }
#>         if (!(length(colData(x)) == 0 || is.null(ncol(colData(x))))) {
#>             sample_data <- as.data.frame(colData(x))
#>             sample_data <- phyloseq::sample_data(sample_data)
#>             args[["sample_data"]] <- sample_data
#>         }
#>         phyloseq <- do.call(phyloseq::phyloseq, args)
#>         return(phyloseq)
#>     }
#>     .local(x, ...)
#> }
#> <bytecode: 0x55bcdbab5c50>
#> <environment: namespace:mia>
#> 
#> Signatures:
#>         x                     
#> target  "SummarizedExperiment"
#> defined "SummarizedExperiment"
#> 
#> $TreeSummarizedExperiment
#> Method Definition:
#> 
#> function (x, ...) 
#> {
#>     obj <- callNextMethod()
#>     args = list()
#>     if (!is(obj, "phyloseq")) {
#>         args[["otu_table"]] <- obj
#>     }
#>     if (!(length(rowTree(x)) == 0 || is.null(rowTree(x)))) {
#>         phy_tree <- rowTree(x)
#>         phy_tree <- phyloseq::phy_tree(phy_tree)
#>         if (is(obj, "phyloseq")) {
#>             phyloseq::phy_tree(obj) <- phy_tree
#>         }
#>         else {
#>             args[["phy_tree"]] <- phy_tree
#>         }
#>     }
#>     if (!(length(referenceSeq(x)) == 0 || is.null(referenceSeq(x)))) {
#>         refseq <- referenceSeq(x)
#>         refseq <- phyloseq::refseq(refseq)
#>         if (is(obj, "phyloseq")) {
#>             obj <- phyloseq::merge_phyloseq(obj, refseq)
#>         }
#>         else {
#>             args[["refseq"]] <- refseq
#>         }
#>     }
#>     if (!is(obj, "phyloseq")) {
#>         phyloseq <- do.call(phyloseq::phyloseq, args)
#>     }
#>     else {
#>         phyloseq <- obj
#>     }
#>     phyloseq
#> }
#> <bytecode: 0x55bcdbac5ef0>
#> <environment: namespace:mia>
#> 
#> Signatures:
#>         x                         
#> target  "TreeSummarizedExperiment"
#> defined "TreeSummarizedExperiment"
#> 
#> Slot arguments:
#> [1] "x"
#> Slot signatures:
#> [[1]]
#> [1] "SummarizedExperiment"
#> 
#> [[2]]
#> [1] "TreeSummarizedExperiment"
#> 
#> Slot generic:
#> standardGeneric for "makePhyloseqFromTreeSummarizedExperiment" defined from package "mia"
#> 
#> function (x, ...) 
#> standardGeneric("makePhyloseqFromTreeSummarizedExperiment")
#> <bytecode: 0x55bcdbab1bf8>
#> <environment: 0x55bcdba7c5d8>
#> Methods may be defined for arguments: x
#> Use  showMethods(makePhyloseqFromTreeSummarizedExperiment)  for currently available ones.

Created on 2021-05-05 by the reprex package (v2.0.0)

sdgamboa commented 3 years ago

@lwaldron, @schifferl, I just added the phyloseq information to the curatedMetagenomicData.Rmd vignette. I tried to reproduce the figures of the phyloseq section of the currently published vignette. I added mia, phyloseq, and ggplot2 (I added this to improve the visualization of the heatmap figure) to the DESCRIPTION file (Suggests).