waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
70 stars 32 forks source link

Default sampleMap in MultiAssayExperiment() fails with unnamed samples #262

Closed LTLA closed 5 years ago

LTLA commented 5 years ago

Consider the following sequence of reasonable events.

library(MultiAssayExperiment)
se <- SummarizedExperiment(list(counts=matrix(rpois(1000, lambda=10), ncol=10)))
ncol(se)
## [1] 10
mae <- MultiAssayExperiment(list(genes=se))
ncol(mae[[1]])
## [1] 0

Where doth my columns go? It looks like I have to actually name them:

colnames(se) <- LETTERS[1:10]
mae3 <- MultiAssayExperiment(list(genes=se))
ncol(mae3[[1]])
## [1] 10

Well, okay, but it would be nice to have that either (i) handled internally somehow or (ii) have MultiAssayExperiment() spit out an error if there are no names.

Session info ``` R Under development (unstable) (2019-11-04 r77366) Platform: x86_64-apple-darwin17.7.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] MultiAssayExperiment_1.13.0 SummarizedExperiment_1.17.0 [3] DelayedArray_0.13.0 BiocParallel_1.21.0 [5] matrixStats_0.55.0 Biobase_2.47.0 [7] GenomicRanges_1.39.1 GenomeInfoDb_1.23.0 [9] IRanges_2.21.1 S4Vectors_0.25.0 [11] BiocGenerics_0.33.0 loaded via a namespace (and not attached): [1] lattice_0.20-38 bitops_1.0-6 grid_4.0.0 [4] zlibbioc_1.33.0 XVector_0.27.0 Matrix_1.2-17 [7] tools_4.0.0 RCurl_1.95-4.12 compiler_4.0.0 [10] GenomeInfoDbData_1.2.2 ```
lwaldron commented 5 years ago

With unnamed columns, the MAE class has no way to map ExperimentList columns back to colData rows. In most situations, the constructor throws a warning when it drops columns that can't be mapped to colData, e.g.:

suppressPackageStartupMessages(library(MultiAssayExperiment))
se2 <- SummarizedExperiment(list(counts=matrix(rpois(1000, lambda=10), ncol=10)))
colnames(se2) <- LETTERS[1:10]
cd2 <- DataFrame(pheno=1:9, row.names = LETTERS[1:9])
mae2 <- MultiAssayExperiment(experiments=list(genes=se2), colData = cd2)
#> Warning in .sampleMapFromData(colData, experiments): Data from rows:
#>  NA - J
#> dropped due to missing phenotype data
ncol(se2)
#> [1] 10
ncol(mae2[[1]])
#> [1] 9
Session info ``` r sessionInfo() #> R version 3.6.1 (2019-07-05) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Debian GNU/Linux 9 (stretch) #> #> Matrix products: default #> BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] parallel stats4 stats graphics grDevices utils datasets #> [8] methods base #> #> other attached packages: #> [1] MultiAssayExperiment_1.12.0 SummarizedExperiment_1.16.0 #> [3] DelayedArray_0.12.0 BiocParallel_1.20.0 #> [5] matrixStats_0.55.0 Biobase_2.46.0 #> [7] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 #> [9] IRanges_2.20.0 S4Vectors_0.24.0 #> [11] BiocGenerics_0.32.0 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.3 knitr_1.26 XVector_0.26.0 #> [4] magrittr_1.5 zlibbioc_1.32.0 lattice_0.20-38 #> [7] rlang_0.4.1 stringr_1.4.0 highr_0.8 #> [10] tools_3.6.1 grid_3.6.1 xfun_0.11 #> [13] htmltools_0.4.0 yaml_2.2.0 digest_0.6.22 #> [16] Matrix_1.2-17 GenomeInfoDbData_1.2.2 bitops_1.0-6 #> [19] RCurl_1.95-4.12 evaluate_0.14 rmarkdown_1.17 #> [22] stringi_1.4.3 compiler_3.6.1 ```

But you seem to have found a corner case where it does not...

I think the current behavior is correct, given that the warning message is added. For the constructor to handle it internally would require making up column names for this SummarizedExperiment, which might be too hands-on and potentially create gotchas - e.g. what if you had specified colData, and it made up colnames that happened to match rownames of the specified colData? For that reason, unless I hear a better counter-argument, I would stick with the remove-and-warn approach.

LTLA commented 5 years ago

I'd be happy with a noisy warning.