Closed mherberg closed 3 years ago
Hi @mherberg
The TPM data should be in HNSC_RNASeqGene-20160128
. If that's not what you're looking
for, you may have to obtain the data directly from the GDAC Firehose.
You may be able to obtain that data from RTCGAToolbox
though I am not sure.
But once you get the data, you should be able to easily integrate it into MultiAssayExperiment
using the c
function.
Best, Marcel
Thanks. My understanding is this that there are two platforms that produce the following gene results:
RNASeq _illuminahiseq_rnaseq-geneexpression | Gene expression values
RNASeqV2 _illuminahiseq_rnaseqv2-RSEMgenes | Gene expression values estimated by RSEM algorithm _illuminahiseq_rnaseqv2-RSEM_genesnormalized | Normalized gene expression values (quantile)
When I download via curatedTCGAData the RNASeqGene always has less specimens. For example in HNSC or LUAD:
[10] LUAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 576 columns
[11] LUAD_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 162 columns
[9] HNSC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 566 columns
[10] HNSC_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 294 columns
So it appears the RNASeq pipeline has less specimens, or am I completely misunderstanding? Wouldn't the more useful non-quantiled data be the _illuminahiseq_rnaseqv2-RSEMgenes, which I assume would have the same number of specimens as the RNASeq2GeneNorm.
I'm happy to post this RTCGAToolbox to request the feature. Unfortunately it does not allow easy import of the _illuminahiseq_rnaseqv2-RSEMgenes data but I can edit the source code to produce.
Unfortunately this is a limit of RTCGAToolbox and will have to be changed in that source code.
Hi @mherberg
Can you please open an issue at https://github.com/mksamur/RTCGAToolbox/issues ?
If you have some code that provides more of the data, I'd be happy to integrate it into RTCGAToolbox
.
Thanks!
Best,
Marcel
Hi @mherberg
This is not a limit of RTCGAToolbox, I've checked the source code and it downloads the appropriate resource with the right number of samples R/getFirehoseData.R:Line#325
. The number of samples are the same.
This is how the data from the GDAC Firehose pipeline is served.
http://firebrowse.org/?cohort=HNSC&download_dialog=true
Best, Marcel
Are you sure. When I look at line 325:
links <- .getLinks("Level_3__gene_expression__data.Level_3","*.Merge_rnaseq__.*._rnaseq__.*.tar[.]gz$",NULL,doc)
It is pulling "illuminahiseq_rnaseq-gene_expression". It is NOT pulling "illuminahiseq_rnaseqv2-RSEM_genes" which is the RNASeqV2 equivalent.
The normalized data is RNASeqV2. See line 345:
plinks <- .getLinks("Level_3__RSEM_genes_normalized__data.Level_3","*.Merge_rnaseqv2__.*._rnaseqv2__.*.tar[.]gz$",NULL,doc)
The RNASeq data has different number of samples from the RNASeqV2 data for HNSC. You can look at the code yourself by importing the files (I used fread).
library(data.table)
rnaseq <- fread("HNSC.rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.data.txt")
rnaseqv2 <- fread("HNSC.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data.txt")
The structure of the rnaseq is 294 patients with three columns each raw_counts, median_length_normalized, and RPKM
The structure of rnaseqv2 is 556 patients with raw_count, scaled_estimate, and transcript_id. FYI scaled_estimate (multiplied by 1e6) is TPM.
Please re-review and let me know if you see the issue. It's a simple code fix, but then you'd probably need to adjust the MultiAssayExperiment.TCGA data so that curatedTCGA is correct. Currently the way it works now for HNSC (and I assume some other cancers) is that the raw count is from the RNASeq and the normalized data is from the RNASeqV2 pipeline.
I've modified RTCGAToolbox so that it can download RNASeqV2 data, both as a raw_count or as a scaled_estimate (which x 1e6 = TPM).
I will post an issue there and then can upload my code changes for review. We can keep this issue open in curatedTCGAData if you feel it is warranted to add RNASeqV2 raw_count and TPM data, which I think you will is very useful for researchers (more so than RNASeq raw_count).
Thanks Matt! @mherberg I am looking forward to your PR :)
Resolved in data release version 2.0.0
5da7bea5d6e2b316f50f8ba957d0dc98e0c596c8
Is there a function to download "rnaseqv2illuminahiseq_rnaseqv2__unc_eduLevel_3__RSEM_genes__data"?
For example when I run
I get back "HNSC_RNASeq2GeneNorm-20160128" and "HNSC_RNASeqGene-20160128". RNASeq2GeneNorm appears to be the quantile normalized RSEM. However RNASeqGene has less specimens/patients than the RNASeq2GeneNorm, so its not what I want.
Should it download or is there a way to download "RNASeq2Gene" (i.e. I believe "rnaseqv2illuminahiseq_rnaseqv2__unc_eduLevel_3__RSEM_genes__data") which I think may be the non-quantiled TPM data.
Thanks.