waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
44 stars 7 forks source link

miRNA is count data and not log2 expression values? #53

Closed bblodfon closed 1 year ago

bblodfon commented 1 year ago

Hi,

In the paper and in the documentation, the miRNA data format is referred to as log2 RPM miRNA expression values. I looked a bit some data (see below code) and it seems to be some form of counts? (so not log2)?

library(curatedTCGAData)

d = curatedTCGAData(diseaseCode = 'BRCA', assays = '*miRNASeq*', 
  version = '2.0.1', dry.run = FALSE)
#> snapshotDate(): 2022-10-31
#> Working on: BRCA_miRNASeqGene-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> Working on: BRCA_colData-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> Working on: BRCA_metadata-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> Working on: BRCA_sampleMap-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> loading from cache
#> harmonizing input:
#>   removing 14736 sampleMap rows not in names(experiments)
#>   removing 342 colData rownames not in sampleMap 'primary'
d
#> A MultiAssayExperiment object of 1 listed
#>  experiment with a user-defined name and respective class.
#>  Containing an ExperimentList class object of length 1:
#>  [1] BRCA_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 849 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

# looks like count data not log2 values?
assay(d)[1:4, 1:4]
#>              TCGA-3C-AAAU-01A-11R-A41G-13 TCGA-3C-AALI-01A-11R-A41G-13
#> hsa-let-7a-1                        95618                        49201
#> hsa-let-7a-2                       189674                        98691
#> hsa-let-7a-3                        96815                        49035
#> hsa-let-7b                         264034                       148591
#>              TCGA-3C-AALJ-01A-31R-A41G-13 TCGA-3C-AALK-01A-11R-A41G-13
#> hsa-let-7a-1                        75342                        57278
#> hsa-let-7a-2                       150472                       114320
#> hsa-let-7a-3                        76206                        57540
#> hsa-let-7b                          99938                       164553

hist(assay(d)[1,])

hist(assay(d)[10,])

Created on 2023-03-24 with reprex v2.0.2

bblodfon commented 1 year ago

I think that a bit better documentation would be instrumental for users to know exactly what the datasets are. Another example is the mRNAArray which is stated as Unified gene-level mRNA expression values. Checking that data type for BRCA I see normalized expression data to 1 standard deviation, is that what the unified refers to? are these data log2-transformed?

LiNk-NY commented 1 year ago

Hi John, @bblodfon Thanks for reporting. Good catch, we are looking at a way to provide the log2 RPM miRNA values through the pipeline. Best, Marcel

bblodfon commented 1 year ago

Hi @LiNk-NY,

I also checked the RNASeq2GeneNorm data, which in the documentation are described as "Upper quartile normalized RSEM TPM gene expression values" but they are count data as well. Could you also have a look at that?

LiNk-NY commented 1 year ago

Thanks, we are in the upload stage of the process. We will have the data available shortly via curatedTCGAData in devel. I will update when that change is ready.

LiNk-NY commented 1 year ago

This should be resolved in the latest data version of 2.1.0 or higher (package version 1.23.5)