Questions of pan-cancer TCGA gene expression data

xiw588 commented 1 year ago

Hi Shixiang,

I am wondering why following the r code you provided online gives the number of sample identifiers (genes) only 6000+ while the Xena official website have more than 20000? https://xenabrowser.net/datapages/?dataset=EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https://pancanatlas.xenahubs.net

Can you please clarify the difference between them? Thanks

# Load R package
library('UCSCXenaTools')

# Generate dataset(s) information
dataset_query <- structure(list(hosts = "https://pancanatlas.xenahubs.net", datasets = "EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena",     url = "https://pancanatlas.xenahubs.net/download/EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz",     browse = "https://xenabrowser.net/datapages/?dataset=EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https://pancanatlas.xenahubs.net"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

# Download dataset(s)
dl <- XenaDownload(dataset_query,
                        destdir = './', # At default, download to working directory
                        download_probeMap = TRUE,
                        trans_slash = TRUE)

# Load dataset(s) into R
datasets <- XenaPrepare(dl)
# Check data
datasets

ShixiangWang commented 1 year ago

The only way to explain this is you get an uncomplete file. Show then console information when you run XenaDownload() can clarify this.

For such a big dataset, I recommend adding options provided in https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html#how-to-resume-file-from-breakpoint.

Also you can use wget command in terminal for downloading it.

wget -c https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz

xiw588 commented 1 year ago

Hi Shixiang,

Thank you so much for your help! I have a follow-up question regarding the normalized pan-cancer gene expression. Do you notice that there are some negative values? The official document says they conducted log2(RESM+1), and this should not introduce any negative values based on my understanding.

Thanks in advance!

ShixiangWang commented 1 year ago

Hi, the Xena https://xenabrowser.net/datapages/?dataset=EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https://pancanatlas.xenahubs.net says the unit is log2(norm + 1). So the case is due to some operations described at https://www.synapse.org/#!Synapse:syn4976363

openbiox / UCSCXenaShiny

Questions of pan-cancer TCGA gene expression data #251