lawrenson-lab / CaCTS

CaCTS
4 stars 1 forks source link

TCGA data/code availability #4

Open j-andrews7 opened 3 years ago

j-andrews7 commented 3 years ago

What are the chances the TCGA data used in the README or the code used to collect it from TCGAbiolinks could be made available?

mabraao commented 2 years ago

Hello @j-andrews7, sorry I got the dataset already normalized and annotated with the subtypes. https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/subtypes.html

Please check the file SuppTable1-34-TCGAID.txt on data folder at this repository to see if it helps you to retrieve the subtypes.

Thanks and sorry for the delay to answer you.

yangmqglobe commented 2 years ago

Similar question, seems lilke all the supplementary tables is not avalable on the ariticle web page, did I miss something?

rashindrie commented 2 years ago

Hi,

Im having trouble trying to creae the TCGA.RNA.Rda object. I downloaded all the data from the site and trying to combine them into one Rda object but noticed that the symbol column is missing.

The Summarized Experiment object downloaded from here does not actually contain the column "gene symbol".

##Gene expression aligned against hg38
query <- GDCquery(
  project = "TCGA-GBM",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification", 
  workflow.type = "HTSeq - FPKM-UQ",
  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01")
)
GDCdownload(query)
data <- GDCprepare(query)

class: RangedSummarizedExperiment 
dim: 56602 2 
metadata(1): data_release
assays(1): HTSeq - FPKM-UQ
rownames(56602): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
  ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
  original_ensembl_gene_id
colnames(2): TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
colData names(105): barcode patient ...
  paper_Telomere.length.estimate.in.blood.normal..Kb.
  paper_Telomere.length.estimate.in.tumor..Kb.

The site does say the following so I think the symbol information is not available now. Unfortunately, some of the updates changes/remove gene symbols, change coordinates, etc. Which might introduce some loss of data. For example, if the gene was removed we cannot map it anymore and that information will be lost in the SummarizedExperiment.

Would you be able to give us directions on how to create the TCGA.RNA.Rda object?

Thanks, Rashindrie

mabraao commented 2 years ago

Hello Rashindrie, thanks for bringing it to my attention.

You can download the R object containing the RNA-seq expression and sample annotation using the following link https://cedars.app.box.com/v/RNA-TCGA-Pancancer

Let me know if it works!

rashindrie commented 2 years ago

It works, Thanks @mabraao!