waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
69 stars 32 forks source link

Simpler Extraction for Matching Clinical Data #216

Closed DarioS closed 6 years ago

DarioS commented 7 years ago

The use case of extracting a subset of the clinical data based on one of the experiments seems to require a somewhat lengthy command.

# measurements is a MultiAssayExperiment object.
clinicalOrdered <- colData(measurements[, , "Protein"])[colnames(assay(measurements[, , "Protein"])), ]

colData has a ... argument which is not currently used. Could there be more concise way to do this, such as colData(measurements, orderBy = "Protein")?

lwaldron commented 7 years ago

Dario's example would get even more complicated if the Protein experiment used a different naming convention than the colData, or if it had missing or replicate columns. I think would a more general implementation of this request would be an endomorphic sorting function, something like:

> sortBy <- function(mae, experiment){
+   df <- sampleMap(mae)
+   df <- df[df$assay == experiment, ]
+   df <- df[match(colnames(mae)[[experiment]], df$colname), ] #will drop any colData without columns in this experiment
+   neworder <- unique(df$primary)
+   mae[, neworder, ]
+ }
> example("MultiAssayExperiment")
> mymae <- myMultiAssayExperiment
## Change up the order of the Affy experiment
> experiments(mymae)[[1]] <- experiments(mymae)[[1]][, 4:1]
> res <- sortBy(mymae, "Affy")
> colnames(mymae)
CharacterList of length 3
[["Affy"]] array4 array3 array2 array1
[["Methyl450k"]] methyl1 methyl2 methyl3 methyl4 methyl5
[["RNASeqGene"]] samparray1 samparray2 samparray3 samparray4
> colnames(res)
CharacterList of length 3
[["Affy"]] array4 array3 array2 array1
[["Methyl450k"]] methyl5 methyl4 methyl3 methyl1 methyl2
[["RNASeqGene"]] samparray3 samparray4 samparray2 samparray1
> colData(mymae)
DataFrame with 4 rows and 2 columns
             sex       age
        <factor> <integer>
Jack           M        38
Jill           F        39
Bob            M        40
Barbara        F        41
> colData(res)
DataFrame with 4 rows and 2 columns
             sex       age
        <factor> <integer>
Bob            M        40
Barbara        F        41
Jill           F        39
Jack           M        38
> 
lwaldron commented 7 years ago

Dario, what is a use case for wanting this, to help us prioritize? It seems pretty obscure to me, given that intersectColumns() already aligns all experiment columns with the colData and each other, albeit based on the order of rows in the colData. Why would you want to sort based on the order in an experiment?

DarioS commented 7 years ago

Say you had a couple of different assays, mRNA and protein. You might like to consider each one separately and build a patient survival model on them, to see which one explains the survival times better. Then, it'd be nice to extract one of the assays out of the set and have the clinical data automatically sorted in the same order to build the survival model with.

lwaldron commented 7 years ago

I agree, even though this is the way I would do a task like you described:

mae <- intersectColumns(mae)  #complete cases only since we'll be comparing data types
mae1 <- mae[, , 1] #let's start with the first assay
df <- wideFormat(mae1, colDataCols = c("time", "cens") )
## pseudocode
model1 <- glmnet(Surv(time, cens) ~ . - primary)
preds1 <- predict(model1)
## real code again
## This next line can I think be made obsolete with a feature request:
preds1 <- preds[match(rownames(colData(mae)), names(preds1))
mae$preds1 <- preds1 #add a new column to colData(mae)
## loop/repeat etc.

@LiNk-NY do you see any problem with sorting the rows of wideFormat() output so that the "primary" column is in the same order as the rownames of colData? This would be my intuitive expectation. Then the match() line above would be unnecessary (as long as the full MAE has the same , and this would align with my expectation that wideFormat() doesn't re-arrange the order of the patients.

As for justifying why I would recommend doing the analysis this way: in general we really designed around the idea that organism/tissue-level information that applies to all the assays belongs in colData, not with the individual assays. wideFormat() and longFormat() already integrate colData with one or more assays in all kinds of complicated situations with correct alignment, the complication above is with putting results back in the MAE, which I think we can eliminate the need for through a patch to the wideFormat() output row sorting. wideFormat() ensures the experimental data are aligned with the survival data in a variables-as-columns dataframe as used by glmnet, penalized, and most ML packages I believe, and isn't dependent on the "physical" layout of each assay.

lwaldron commented 7 years ago

So just to clarify, I agree that this is a reasonable use case, and I think there is something we can do to simplify it, it's just not in the way you proposed...