waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
41 stars 7 forks source link

high-level binding of colData? #42

Closed vjcitn closed 3 years ago

vjcitn commented 3 years ago

i was surprised that when i subset an MAE to the RNASeq2GeneNorm, the colData is empty. the package vignette should cover how to properly bind the colData and filter to primary tumor samples. i could attempt a PR to address this if it sounds appropriate.

vjcitn commented 3 years ago

It looks like the colData field for BRCA that matches the RNASeq2GeneNorm colnames is

"patient.samples.sample.portions.portion.analytes.analyte.2.aliquots.aliquot.2.bcr_aliquot_barcode"

LiNk-NY commented 3 years ago

Thanks Vince @vjcitn, I've added this in 8e474ad. I am not sure what you mean when you talk about the colnames. Did you want them to be matched in the colData? This can be taken care of by the user. The current operation takes the entirety of the colData in the MAE and appends it to the colData of the extracted object.

vjcitn commented 3 years ago

I guess what is surprising to me may be shown in the following. I add some comments on the right -- maybe there are methods I don't know about?

> suppressMessages({x = curatedTCGAData("BRCA", "RNASeq2GeneNorm", dry=FALSE)})
> rnaseq = experiments(x)[[1]]
> dim(colData(rnaseq))   ### so the colData need to be assigned somehow
[1] 1212    0
> dim(colData(x))   ### the MAE only has 1093 participants ... OK, some RNA-seq samples are normal 
[1] 1093 2684
> colnames(rnaseq)[1:3]
[1] "TCGA-3C-AAAU-01A-11R-A41B-07" "TCGA-3C-AALI-01A-11R-A41B-07"
[3] "TCGA-3C-AALJ-01A-31R-A41B-07"
> rownames(colData(x))[1:3]                                 ### the user has to substring and check sample type?
[1] "TCGA-A1-A0SB" "TCGA-A1-A0SD" "TCGA-A1-A0SE"

> sessionInfo()
R version 4.0.2 Patched (2020-07-19 r78892)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS (fossa-melisa X20)

Matrix products: default
BLAS:   /home/stvjc/R-4-0-dist/lib/R/lib/libRblas.so
LAPACK: /home/stvjc/R-4-0-dist/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TCGAutils_1.10.0            curatedTCGAData_1.12.0     
 [3] MultiAssayExperiment_1.16.0 SummarizedExperiment_1.20.0
LiNk-NY commented 3 years ago

Hi Vince, @vjcitn

Please use MultiAssayExperiment::getWithColData. There may be some repreated columns in both MAE-level and assay-level colData objects. Conflicts will produce a warning (as seen below).

suppressPackageStartupMessages({
    library(curatedTCGAData)
})
brca <- curatedTCGAData(
    "BRCA", "RNASeq2GeneNorm", dry=FALSE, version = "2.0.0"
)
#> snapshotDate(): 2020-11-25
#> Working on: BRCA_RNASeq2GeneNorm-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> Working on: BRCA_colData-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> Working on: BRCA_metadata-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> Working on: BRCA_sampleMap-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> harmonizing input:
#>   removing 14373 sampleMap rows not in names(experiments)
#>   removing 5 colData rownames not in sampleMap 'primary'
getWithColData(brca, "BRCA_RNASeq2GeneNorm-20160128")
#> Warning: Duplicating colData rows due to replicates in 'replicated(x)'
#> class: SummarizedExperiment 
#> dim: 20501 1212 
#> metadata(3): filename build platform
#> assays(1): ''
#> rownames(20501): A1BG A1CF ... psiTPTE22 tAKR
#> rowData names(0):
#> colnames(1212): TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
#> colData names(2684): patientID years_to_birth ...
#>   Integrated.Clusters..unsup.exp. X60.Gene.classifier.Class.Assignment

Created on 2020-11-30 by the reprex package (v0.3.0)

I also want to note that version 2.0.0 includes various improvements to the data provided. See the NEWS.md file for details.

Thanks

vjcitn commented 3 years ago

Super, thank you!