waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
41 stars 7 forks source link

Download counts (TPM) data? #38

Closed mherberg closed 3 years ago

mherberg commented 3 years ago

Is there a function to download "rnaseqv2illuminahiseq_rnaseqv2__unc_eduLevel_3__RSEM_genes__data"?

For example when I run

curatedTCGAData(diseaseCode = "HNSC", assays = "RNASeq*", dry.run = TRUE)

I get back "HNSC_RNASeq2GeneNorm-20160128" and "HNSC_RNASeqGene-20160128". RNASeq2GeneNorm appears to be the quantile normalized RSEM. However RNASeqGene has less specimens/patients than the RNASeq2GeneNorm, so its not what I want.

Should it download or is there a way to download "RNASeq2Gene" (i.e. I believe "rnaseqv2illuminahiseq_rnaseqv2__unc_eduLevel_3__RSEM_genes__data") which I think may be the non-quantiled TPM data.

Thanks.

LiNk-NY commented 3 years ago

Hi @mherberg

The TPM data should be in HNSC_RNASeqGene-20160128. If that's not what you're looking for, you may have to obtain the data directly from the GDAC Firehose. You may be able to obtain that data from RTCGAToolbox though I am not sure. But once you get the data, you should be able to easily integrate it into MultiAssayExperiment using the c function.

Best, Marcel

mherberg commented 3 years ago

Thanks. My understanding is this that there are two platforms that produce the following gene results:

RNASeq _illuminahiseq_rnaseq-geneexpression | Gene expression values

RNASeqV2 _illuminahiseq_rnaseqv2-RSEMgenes | Gene expression values estimated by RSEM algorithm _illuminahiseq_rnaseqv2-RSEM_genesnormalized | Normalized gene expression values (quantile)

When I download via curatedTCGAData the RNASeqGene always has less specimens. For example in HNSC or LUAD:

[10] LUAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 576 columns
[11] LUAD_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 162 columns

[9] HNSC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 566 columns
[10] HNSC_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 294 columns

So it appears the RNASeq pipeline has less specimens, or am I completely misunderstanding? Wouldn't the more useful non-quantiled data be the _illuminahiseq_rnaseqv2-RSEMgenes, which I assume would have the same number of specimens as the RNASeq2GeneNorm.

I'm happy to post this RTCGAToolbox to request the feature. Unfortunately it does not allow easy import of the _illuminahiseq_rnaseqv2-RSEMgenes data but I can edit the source code to produce.

mherberg commented 3 years ago

Unfortunately this is a limit of RTCGAToolbox and will have to be changed in that source code.

LiNk-NY commented 3 years ago

Hi @mherberg Can you please open an issue at https://github.com/mksamur/RTCGAToolbox/issues ? If you have some code that provides more of the data, I'd be happy to integrate it into RTCGAToolbox. Thanks! Best, Marcel

LiNk-NY commented 3 years ago

Hi @mherberg

This is not a limit of RTCGAToolbox, I've checked the source code and it downloads the appropriate resource with the right number of samples R/getFirehoseData.R:Line#325. The number of samples are the same.

This is how the data from the GDAC Firehose pipeline is served. http://firebrowse.org/?cohort=HNSC&download_dialog=true image

Best, Marcel

mherberg commented 3 years ago

Are you sure. When I look at line 325: links <- .getLinks("Level_3__gene_expression__data.Level_3","*.Merge_rnaseq__.*._rnaseq__.*.tar[.]gz$",NULL,doc)

It is pulling "illuminahiseq_rnaseq-gene_expression". It is NOT pulling "illuminahiseq_rnaseqv2-RSEM_genes" which is the RNASeqV2 equivalent.

The normalized data is RNASeqV2. See line 345: plinks <- .getLinks("Level_3__RSEM_genes_normalized__data.Level_3","*.Merge_rnaseqv2__.*._rnaseqv2__.*.tar[.]gz$",NULL,doc)

The RNASeq data has different number of samples from the RNASeqV2 data for HNSC. You can look at the code yourself by importing the files (I used fread).

library(data.table)
rnaseq <- fread("HNSC.rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.data.txt")
rnaseqv2 <- fread("HNSC.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data.txt")

The structure of the rnaseq is 294 patients with three columns each raw_counts, median_length_normalized, and RPKM

The structure of rnaseqv2 is 556 patients with raw_count, scaled_estimate, and transcript_id. FYI scaled_estimate (multiplied by 1e6) is TPM.

Please re-review and let me know if you see the issue. It's a simple code fix, but then you'd probably need to adjust the MultiAssayExperiment.TCGA data so that curatedTCGA is correct. Currently the way it works now for HNSC (and I assume some other cancers) is that the raw count is from the RNASeq and the normalized data is from the RNASeqV2 pipeline.

mherberg commented 3 years ago

I've modified RTCGAToolbox so that it can download RNASeqV2 data, both as a raw_count or as a scaled_estimate (which x 1e6 = TPM).

I will post an issue there and then can upload my code changes for review. We can keep this issue open in curatedTCGAData if you feel it is warranted to add RNASeqV2 raw_count and TPM data, which I think you will is very useful for researchers (more so than RNASeq raw_count).

LiNk-NY commented 3 years ago

Thanks Matt! @mherberg I am looking forward to your PR :)

LiNk-NY commented 3 years ago

Resolved in data release version 2.0.0 5da7bea5d6e2b316f50f8ba957d0dc98e0c596c8