waldronlab / curatedMetagenomicData

Curated Metagenomic Data of the Human Microbiome
https://waldronlab.io/curatedMetagenomicData
Artistic License 2.0
126 stars 28 forks source link

Reversing ExpressionSet2MRexperiment and ExpressionSet2phyloseq #96

Closed lgeistlinger closed 7 years ago

lgeistlinger commented 7 years ago

For downstream analysis, such as e.g. enrichment analysis with the EnrichmentBrowser, it would be great if these operation could be reversed. This means two functions MRexperiment2ExpressionSet and phyloseq2ExpressionSet would be nice to have in order to continue analysis with a data structure that is well established.

I've already tried

as(mrexp, "ExpressionSet")

using an object mrexp of class MRexperiment but that throws an error.

schifferl commented 7 years ago

thank you @lgeistlinger I will see if it can be done easily

lwaldron commented 7 years ago

Such a coercion function would be best situated in the phyloseq package, @joey711 do you agree? The reason for a special ExpressionSet2phyloseq() package in curatedMetagenomicData is because it makes specific assumptions about the rownames and the contents of pData in order to construct the phyloseq object. I can't think of any such assumptions needed for going in reverse.

joey711 commented 7 years ago

I agree. phyloseq2ExpressionSet makes sense, and should go in phyloseq. Things that depend on ExpressionSet-class (and there are many) should not need to know about phyloseq to operate properly, and a maintenance nightmare/anti-pattern if every new special case package required that the general abstraction class (which we'll say is ExpressionSet-class) needed to add more transformation methods.

joey711 commented 7 years ago

fyi, I was just looking at an equivalent biom2ExpressionSet method to add to biomformat package. I can borrow that pattern for the phyloseq case.

Happy to take a peek at any draft code anyone might have already lying around, or tips. Otherwise I'm just going to go off of the constructors in the vignette.

lwaldron commented 7 years ago

This seems to work as a coercion definition:

setAs(from="phyloseq", to="ExpressionSet", function(object) {
  library(Biobase)
  ExpressionSet(assayData=otu_table(object),
                phenoData=AnnotatedDataFrame(data.frame(sample_data(object))),
                featureData=AnnotatedDataFrame(data.frame(otu_table(object))))
  })

Not sure, but putting library(Biobase) seems appropriate instead of importFrom( Biobase, ExpressionSet, AnnotatedDataFrame) in the NAMESPACE, so the user has Biobase loaded after doing this coercion and can actually use the object.

joey711 commented 7 years ago

nice. Thanks. I think featureData= is probably tax_table(object).

That last bit about loading Biobase makes sesnse, too. Will try to keep Biobase a Suggests dependency.

lwaldron commented 7 years ago

Oops, yes good catch tax_table(object), and Biobase in Suggests should be fine. @h.corradabravo I posted an Issue also at https://github.com/HCBravoLab/metagenomeSeq/issues/53.

@lgeistlinger have you tried just using the MRexperiment objects as is? Since it inherits from eSet, it should just work if the functions you're using only need is(mrexp, "eSet").

lgeistlinger commented 7 years ago

This is what I did

library(curatedMetagenomicData)
library(metagenomeSeq)

zeller.eset <- ZellerG_2014.metaphlan_bugs_list.stool()
zeller.mexp <- ExpressionSet2MRexperiment(zeller.eset)

That throws an error:

as(zeller.mexp, "ExpressionSet")

That indeed works:

as(zeller.mexp, "eSet")
joey711 commented 7 years ago

That's strange. From "?eSet-class" doc:

eSet is a virtual class, so instances cannot be created.

Objects created under previous definitions of eSet-class can be coerced to the current classes derived from eSet using updateOldESet.

joey711 commented 7 years ago

From metagenomeSeq allClasses.R:

setClass("MRexperiment", contains=c("eSet"), representation=representation(expSummary = "environment"),prototype = prototype( new( "VersionedBiobase",versions = c(classVersion("eSet"),MRexperiment = "1.0.0" ))))

Should it inherit from ExpressionSet instead of directly from virtual class eSet? What's the convention recommended by BioC?

lgeistlinger commented 7 years ago

@lwaldron @joey711 :

In the setAs from phyloseq to ExpressionSet, I guess you would typically use require instead of library.

You could also use the .isAvailable hook as eg. available here:

https://github.com/lgeistlinger/EnrichmentBrowser/blob/master/R/utils.R

lwaldron commented 7 years ago

Since Biobase will never be unavailable if phyloseq is installed, I don't think you have to bother checking, and library/require will have the same side-effect of loading the package (if different return values). @lgeistlinger your hook is nice, especially checking all the available annotation packages and installing. In simpler cases, someone less helpful could make the user install packages themselves and just do:

if (!requireNamespace("veryimportantpackage")) {
  stop("Please install the veryimportantpackage package to use this functionality")
}

@lgeistlinger you might check whether it's release/devel by even/oddness of version number at https://github.com/lgeistlinger/EnrichmentBrowser/blob/b3cac3d2a285aaec69dbf23f63371661e13f60f3/R/utils.R#L53 ?

lwaldron commented 7 years ago

@lgeistlinger as(zeller.mexp, "eSet") doesn't do anything, it'll return back the MRExperiment class object. I meant that, anything requiring an eSet-derived class will just work with MRExperiment class object. But not if it requires an ExpressionSet-derived class.

@joey711 new classes today should really derive from SummarizedExperiment instead of either eSet or ExpressionSet. But on the difference between eSet and ExpressionSet, the ExpressionSet definition only puts an added requirement on the experimentData slot (is MIAME class). Deriving from ExpressionSet would seem more beneficial so that the object is both eSet and ExpressionSet, and probably compatible with more other Bioconductor packages.

jnpaulson commented 7 years ago

I'll add some coercion functions for an expression set and summarizedexperiment

lwaldron commented 7 years ago

@jnpaulson you may not have to do anything since MRexperiment is derived from eSet, but 1) I have some questions for bioc-devel about why not all of these coercions work, and 2) deriving from ExpressionSet instead would give you direct compatibility with things expecting ExpressionSet and eliminate the need to coerce to ExpressionSet:

> library(metagenomeSeq)
> data(mouseData)
> SummarizedExperiment(mouseData)
class: SummarizedExperiment
dim: 10172 139
metadata(0):
assays(1): ''
rownames(10172): Prevotellaceae:1 Lachnospiraceae:1 ... Bryantella:103
  Parabacteroides:956
rowData names(0):
colnames(139): PM1:20080107 PM1:20080108 ... PM9:20080225 PM9:20080303
colData names(0):
> as(mouseData, "SummarizedExperiment")
Error in as(mouseData, "SummarizedExperiment") :
  no method or default for coercing “MRexperiment” to “SummarizedExperiment”
> ExpressionSet(mouseData)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘ExpressionSet’ for signature ‘"MRexperiment"’
> as(mouseData, "ExpressionSet")
Error in updateOldESet(from, "ExpressionSet") :
  no slot of name "pData" for this object of class "AnnotatedDataFrame"
>
lgeistlinger commented 7 years ago

@lwaldron:

new classes today should really derive from SummarizedExperiment

Totally agreed. Accordingly having a function phyloseq2SummarizedExperiment would be fine (as soon as coercion from ExpressionSet to SummarizedExperiment, and vice versa, is possible; which currently for some reason doesn't seem to be the case).

My issue with dealing with MRexperiment was that it does not support the usual accessing via exprs(mrexp), but rather assayData(mrexp)[["counts"]] or the shortcut MRcounts(mrexp). But this is fine, I guess, just makes downstream handling a bit more difficult.

jnpaulson commented 7 years ago

That was intentional in our design.

On 2017-09-11 09:11, lgeistlinger wrote:

@lwaldron [1]:

new classes today should really derive from SummarizedExperiment

Totally agreed. Accordingly having a function phyloseq2SummarizedExperiment would be fine (as soon as coercion from ExpressionSet to SummarizedExperiment, and vice versa, is possible; which currently for some reason doesn't seem to be the case).

My issue with dealing with MRexperiment was that it does not support the usual accessing via exprs(mrexp), but rather assayData(mrexp)[["counts"]] or the shortcut MRcounts(mrexp). But this is fine, I guess, just makes downstream handling a bit more difficult.

-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [2], or mute the thread [3].

Links:

[1] https://github.com/lwaldron [2] https://github.com/waldronlab/curatedMetagenomicData/issues/96#issuecomment-328579042 [3] https://github.com/notifications/unsubscribe-auth/AC5hgx-TK36xpeyWomw0SGzjYXvlH37kks5shVu3gaJpZM4PRub8

schifferl commented 7 years ago

Hi all, I wonder if there has been any progress on this issue since September. Specifically, does the SummarizedExperiment2phyloseq method exist at present? I need it for another project and will write it if someone else hasn't yet – just want to know before I dive in. Also, I wonder if we can close this issue and / or refine it a bit into a more discrete TODO. Let me know! Thanks

lwaldron commented 7 years ago

It does not exist, but could be implemented easily, the essence being:

SummarizedExperiment2phyloseq <- function(object){
    eset <- as(object, "ExpressionSet")
    ExpressionSet2phyloseq(eset)
}

Since there are no SummarizedExperiments in curatedMetagenomicData, I think this would currently belong in HMP16SData rather than here.

If there is motivation on the part of microbiome package developers @joey711 @jnpaulson, I think we would all benefit from agreeing on a new Bioconductor core class for microbiome data that extends SummarizedExperiment. For a little pain this would have a number of long-term benefits. Perhaps we could schedule a call to discuss, but we don't need to determine it on this Issue thread. For now, I would close this issue and say @lgeistlinger should just use hacks to reverse these transformations while we try to agree on a unified approach to representing microbiome data in Bioconductor.

lgeistlinger commented 7 years ago

I'm fine with that for the moment.

schifferl commented 7 years ago

Great, I had in mind to take the strategy you outlined above. As for the new class, it does seem that an extension of SummarizedExperiment would be the right approach and I am happy to do the "pain" part of the job. A call and further discussion would be better forum though. I am going to close this for now and will put the SummarizedExperiment2phyloseq method in the HMP16SData package. 👍

jnpaulson commented 7 years ago

@lwaldron @joey711 Agreed. We should meet / call and make some of these calls. We've been floating it around for a while now and I think it's time to migrate them all.

lwaldron commented 7 years ago

@jnpaulson that's great. @joey711 what do you think? I'm willing to work on design and even on helping integrate into phyloseq and metagenomeSeq, as long as you're on board too. I think it can be done in a way as to maintain the current phyloseq API, but replacing the class with one containing SummarizedExperiment. This will have a number of long-term benefits, including compatibility with common Bioconductor operations, compatibility with MultiAssayExperiment and with Herve's current work providing on-disk and remote storage of SummarizedExperiment assays.

schifferl commented 6 years ago

Just having this thought, what if we call the class MicrobiomeExperiment and write it as an extension of SummarizedExperiment with the ability to handle tree structures?

lgeistlinger commented 6 years ago

I guess that's pretty much the way to go here. The main question is how to implement compatibility with tree structures and what can be recycled / adapted from the corresponding implementations in MRexperiment and phyloseq. Furthermore, there have been many thoughts on the bioc devel mailing list on naming conventions - with the conclusion that you don't want to limit applicability by naming. Thus, I guess it is worth considering a name that not automatically limit the scope of the class to microbiome analysis, but rather applies naturally to analysis of species that can be related to each other via a (phylogenetic) tree. An example would be an analysis of various yeast strains, where such a class might also be applicable. It is thus, in this regard, beneficial to choose general names such as SummarizedExperiment, MultiAssayExperiment and RaggedExperiment.

However, if it can be ruled out that such functionality might be relevant also outside of microbiome analysis, a specific name such as MicrobiomeExperiment should be chosen.