waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
42 stars 7 forks source link

check patient counts and overlaps for COAD #3

Closed lwaldron closed 3 years ago

lwaldron commented 7 years ago

Do a human check to make sure this is correct - only 10 samples for RNASeqGene? It seems to be in low numbers or not present for other datasets as well.

> library(MultiAssayExperiment)
> library(RaggedExperiment)
> coad <- readRDS("coadMAEO.rds")
> coad <- updateObject(coad)
> experiments(coad)
ExperimentList class object of length 12: 
** [1] RNASeqGene: ExpressionSet with 20502 rows and 10 columns **
 [2] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 191 columns 
 [3] miRNASeqGene: ExpressionSet with 705 rows and 221 columns 
 [4] CNASNP: RaggedExperiment with 457535 rows and 914 columns 
 [5] CNVSNP: RaggedExperiment with 90062 rows and 914 columns 
 [6] CNAseq: RaggedExperiment with 40530 rows and 136 columns 
 [7] Methylation: SummarizedExperiment with 485577 rows and 333 columns 
 [8] mRNAArray: ExpressionSet with 17814 rows and 172 columns 
 [9] RPPAArray: ExpressionSet with 208 rows and 360 columns 
 [10] Mutations: RaggedExperiment with 62530 rows and 154 columns 
 [11] gistica: SummarizedExperiment with 24776 rows and 448 columns 
 [12] gistict: SummarizedExperiment with 24776 rows and 448 columns 
> upsetSamples(coad[, , c("CNASNP", "RNASeq2GeneNorm", "RNASeqGene")])

coadupset

lgeistlinger commented 6 years ago

Yes, I also observe sample numbers to be too low and not in agreement with the actual sample number provided by Broad Firehose itself, as eg for SARC

http://firebrowse.org/?cohort=SARC&download_dialog=true

> x <- curatedTCGAData(diseaseCode = "SARC", assays = "RNASeqGene", FALSE)
Error in curatedTCGAData(diseaseCode = "SARC", assays = "RNASeqGene",  : 
  Cancer and data type combination(s) not available
> x <- curatedTCGAData(diseaseCode = "SARC", assays = "RNASeqGeneNorm", FALSE)
Error in .searchFromInputs(assays, assaysAvail) : 
  No matches found, modify search criteria
LiNk-NY commented 3 years ago

This looks like it was returning Level_3__gene_expression__data when we intended to provide Level_3__RSEM_genes__data. This accounts for the discrepancy.

We now have RNASeqGene and RNASeq2Gene options in RTCGAToolbox and in version = "2.0.0" for curatedTCGAData.

See data release version 2.0.0 in devel version 1.13.1 (commit 5da7bea5d6e2b316f50f8ba957d0dc98e0c596c8)

> getFirehoseData("COAD", RNASeqGene=TRUE, clinical = FALSE)
trying URL 'http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/COAD/20160128/gdac.broadinstitute.org_COAD.Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2016012800.0.0.tar.gz'
Content type 'application/x-gzip' length 3593305 bytes (3.4 MB)
==================================================
downloaded 3.4 MB

gdac.broadinstitute.org_COAD.Merge_rnaseq__illuminaga_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2016012800.0.0
RNAseq data will be imported! This may take a while!
Start: 2020-11-30 14:31:58
Done: 2020-11-30 14:31:58
COAD FirehoseData objectStandard run date: 20160128 
Analysis running date: NA 
Available data types: 
  RNASeqGene: A matrix of count or normalized data, dim:  20502 x 10 
To export data, use the 'getData' function.

##

> getFirehoseData("COAD", RNASeq2Gene=TRUE, clinical = FALSE)
trying URL 'http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/COAD/20160128/gdac.broadinstitute.org_COAD.Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0.tar.gz'
Content type 'application/x-gzip' length 49751986 bytes (47.4 MB)
==================================================
downloaded 47.4 MB

gdac.broadinstitute.org_COAD.Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0
RNAseq2 data will be imported! This may take a while!
Start: 2020-11-30 14:32:45
Done: 2020-11-30 14:32:47
Using locally cached version of /tmp/RtmpiJLElh/20160128-COAD-RNAseq2Gene.txt
RNAseq2 data will be imported! This may take a while!
Start: 2020-11-30 14:32:53
Done: 2020-11-30 14:32:55
COAD FirehoseData objectStandard run date: 20160128 
Analysis running date: NA 
Available data types: 
  RNASeq2Gene: A matrix of count or scaled estimate data, dim:  20501 x 191 
To export data, use the 'getData' function.

See #38 for details