waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
70 stars 32 forks source link

getWithColData doesn't handle 1:many mappings #272

Closed LTLA closed 4 years ago

LTLA commented 4 years ago
library(SummarizedExperiment)

### EXPERIMENT 1 ###
Y1 <- matrix(rnorm(200), ncol=10)
colnames(Y1) <- c("A.1", "A.2", 
    "B.1", "B.2", "B.3", 
    "C.1", "C.2",
    "D.1",
    "E.1", "E.2")

# Making up some sample-level metadata, for some variety.
df <- DataFrame(RIN=runif(10), sizeFactor=runif(10))
rownames(df) <- colnames(Y1)
se1 <- SummarizedExperiment(list(counts=Y1), colData=df)

### EXPERIMENT 2 ###
Y2 <- matrix(rnorm(200), ncol=5)
colnames(Y2) <- c("B", "C", "D", "E", "F")

# Making up some more sample-level metadata.
df <- DataFrame(FrIP=runif(5), NumPeaks=sample(1000, 5))
rownames(df) <- colnames(Y2)
se2 <- SummarizedExperiment(list(counts=Y2), colData=df)

### MAE STUFF ###
sampleMap <- rbind(
    DataFrame(assay="rnaseq", 
        primary=sub("\\..*", "", colnames(se1)), 
        colname=colnames(se1)
    ),
    DataFrame(assay="chipseq", 
        primary=sub("\\..*", "", colnames(se2)), 
        colname=colnames(se2)
    )
)

colData <- DataFrame(
    row.names=LETTERS[1:6], 
    Sex=sample(c("M", "F"), 6, replace=TRUE),
    Age=sample(10:50, 6)
)

library(MultiAssayExperiment)
mae <- MultiAssayExperiment(list(rnaseq=se1, chipseq=se2),
    sampleMap=sampleMap, colData=colData)

Now I just want to be able to extract SEs while propagating colData from the MAEs. But:

getWithColData(mae, 1)
## harmonizing input:
##   removing 5 sampleMap rows not in names(experiments)
##   removing 1 colData rownames not in sampleMap 'primary'
## Error: Extracted class does not support 'colData<-':
##     nrow of supplied 'colData' must equal ncol of object

Looking inside getWithColData, it seems that the function doesn't even use the sampleMap information, which kind of defeats the purpose of using the MAE to store complex things.

I'd also like the option to append the MAE-level colData to the SE's existing colData rather than overwriting it. In fact, I daresay that ought to be the default setting, why discard information?

Session information ``` R Under development (unstable) (2020-03-11 r77927) Platform: x86_64-apple-darwin17.7.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] MultiAssayExperiment_1.13.21 SummarizedExperiment_1.17.5 [3] DelayedArray_0.13.12 matrixStats_0.56.0 [5] Biobase_2.47.3 GenomicRanges_1.39.3 [7] GenomeInfoDb_1.23.17 IRanges_2.21.8 [9] S4Vectors_0.25.15 BiocGenerics_0.33.3 loaded via a namespace (and not attached): [1] lattice_0.20-41 bitops_1.0-6 grid_4.0.0 [4] zlibbioc_1.33.1 XVector_0.27.2 Matrix_1.2-18 [7] tools_4.0.0 RCurl_1.98-1.1 compiler_4.0.0 [10] GenomeInfoDbData_1.2.3 ```
LTLA commented 4 years ago

FYI, this is what I did. Plug and play into getWithColData:

            # Avoid dealing with int/char differences.
            mae0 <- mae[,,experiment]

            sm <- sampleMap(mae0)
            mae.cn <- sm$primary[match(colnames(se), sm$colname)]
            expanded <- colData(mae0)[mae.cn,,drop=FALSE]

            # Prune out duplicate entries.
            existing <- colData(se)
            common <- intersect(colnames(existing), colnames(expanded))
            for (j in common) {
                if (!identical(existing[[j]], expanded[[j]])) {
                    warning(sprintf("non-identical '%s' field in both SE and MAE-level colData", j))
                }
            }

            leftovers <- expanded[,setdiff(colnames(expanded), common),drop=FALSE]
            colData(se) <- cbind(existing, leftovers)
LiNk-NY commented 4 years ago

Hi Aaron, @LTLA

It does use sampleMap information during subsetting.

I'm not sure what the rationale is for duplicating the colData information in the expanded DataFrame.

sm <- sampleMap(mae0)
- mae.cn <- sm$primary[match(colnames(se), sm$colname)]
- expanded <- colData(mae0)[mae.cn,,drop=FALSE]
+ expanded <- colData(mae0)[sm$primary, , drop = FALSE]
> colData(mae0)[sm$primary, , drop = FALSE]
DataFrame with 10 rows and 2 columns
          Sex       Age
  <character> <integer>
A           M        42
A           M        42
B           F        16
B           F        16
B           F        16
C           M        12
C           M        12
D           F        23
E           M        28
E           M        28

instead of

> colData(mae[, , 1L])
harmonizing input:
  removing 5 sampleMap rows not in names(experiments)
  removing 1 colData rownames not in sampleMap 'primary'
DataFrame with 5 rows and 2 columns
          Sex       Age
  <character> <integer>
A           M        42
B           F        16
C           M        12
D           F        23
E           M        28

Perhaps because the input contains technical replicates?

> replicated(mae)
$rnaseq
LogicalList of length 5
[["A"]] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[["B"]] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[["C"]] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[["D"]] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[["E"]] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE

$chipseq
LogicalList of length 5
[["B"]] FALSE FALSE FALSE FALSE FALSE
[["C"]] FALSE FALSE FALSE FALSE FALSE
[["D"]] FALSE FALSE FALSE FALSE FALSE
[["E"]] FALSE FALSE FALSE FALSE FALSE
[["F"]] FALSE FALSE FALSE FALSE FALSE

The function would work if those were merged or separated into different assays before extracting the colData. I suppose we could add a function like separateReplicates. I can also add a check to the function when the input contains replicates.

Best, Marcel cc: @lwaldron

LTLA commented 4 years ago

I'm not arguing about the subsetting, that works fine. I'm talking about the behavior of getWithColData() and the propagation of MAE-level colData down to the individual SE's. As the MWE in my original post indicates, that function fails in the case of a non-1:1: mapping.