ropensci / UCSCXenaTools

:package: An R package for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq https://cran.r-project.org/web/packages/UCSCXenaTools/
https://docs.ropensci.org/UCSCXenaTools
GNU General Public License v3.0
106 stars 12 forks source link

how to get log(TPM+1) values #44

Open sunta3iouxos opened 3 weeks ago

sunta3iouxos commented 3 weeks ago

Thank you for this tool. I am a novice in all TCGA data, but I am looking to do some analysis, and I wanted to download TPM normalised values, so that I can compine my own RNA-seq data. I think for my need, want to do GSVA, the TPM should be more appropriate than the percentile ranking. From some tutorials I got some values that look more scaled than TPM normalised. I want to use the data for GSVA or singscore Is there a way to accomplish this with the XENAtools? This is the code: (taken from https://github.com/XSLiuLab/tumor-immunogenicity-score)

library(UCSCXenaTools)
library(dplyr)
xe <- XenaGenerate(subset = XenaHostNames == "tcgaHub")
xe %>% XenaFilter(filterDatasets = "clinical") -> xe_clinical
xe %>% XenaFilter(filterDatasets = "HiSeqV2_PANCAN$") -> xe_rna_pancan
#Create data queries and download them:
# download_xena_pancan, eval=FALSE
xe_clinical.query <- XenaQuery(xe_clinical)
xe_clinical.download <- XenaDownload(xe_clinical.query,
  destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE, force = TRUE
)

xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,
  destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE
)
# hide_download_pancan, include=FALSE
if (!dir.exists("UCSC_Xena")) {
  xe_clinical.query <- XenaQuery(xe_clinical)
  xe_clinical.download <- XenaDownload(xe_clinical.query,
    destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE
  )

  xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
  xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,
    destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE
  )
}

The author of the code mentions: The RNASeq data we downloaded are pancan normalized. For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the "gene expression RNAseq" dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: [here](https://docs.google.com/document/d/1q-7Tkzd7pci4Rz-_IswASRMRzYrbgx1FTTfAWOyHbmk/edit?usp=sharing

Do you have any recommendations on this? Theodoros

github-actions[bot] commented 3 weeks ago

Thanks for reporting, Shixiang will reply as soon as possible:)

ShixiangWang commented 3 weeks ago

Hi, for simple datasets, you can find the count data in the gdc hub, and transform it into TPM format.

sunta3iouxos commented 3 weeks ago

Thank you for this, but it seems that I can not download the counts:

library(UCSCXenaTools)
XE <- XenaGenerate(subset = XenaHostNames == "gdcHub")
XE %>% XenaFilter(filterDatasets = "clinical") -> XE_clinical
XE %>% XenaFilter(filterDatasets = "htseq_counts") -> XE_rna_counts
#download gdc
#download clinical information, this one works
XE_clinical.query <- XenaQuery(XE_clinical)
XE_clinical.download <- XenaDownload(XE_clinical.query,
                                     destdir = "UCSC_Xena/TCGA/counts_Clinical", trans_slash = TRUE, force = TRUE
)
#try to download the counts
XE_rna_counts.query <- XenaQuery(XE_rna_counts)
XE_rna_counts.download <- XenaDownload(XE_rna_counts.query,
                                       destdir = "UCSC_Xena/TCGA/counts_RNAseq", trans_slash = TRUE
)
if (!dir.exists("UCSC_Xena")) {
    XE_clinical.query <- XenaQuery(XE_clinical)
    XE_clinical.download <- XenaDownload(XE_clinical.query,
                                         destdir = "UCSC_Xena/TCGA/counts_Clinical", trans_slash = TRUE
    )

    XE_rna_pancan.query <- XenaQuery(XE_rna_pancan)
    XE_rna_pancan.download <- XenaDownload(XE_rna_pancan.query,
                                           destdir = "UCSC_Xena/TCGA/counts_RNAseq", trans_slash = TRUE
    )
}

downolading of all gdc counts fails:

Downloading TCGA-LAML.htseq_counts.tsv.gz
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
==> Trying #2
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
==> Trying #3
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
Tried 3 times but failed, please check your internet connection!

this is what the quesrry looks like:

> head(XE_rna_pancan.download)
                     hosts                       datasets
1 https://gdc.xenahubs.net     TCGA-BLCA.htseq_counts.tsv
2 https://gdc.xenahubs.net     TCGA-LUSC.htseq_counts.tsv
3 https://gdc.xenahubs.net     TCGA-ESCA.htseq_counts.tsv
4 https://gdc.xenahubs.net     TARGET-RT.htseq_counts.tsv
5 https://gdc.xenahubs.net MMRF-COMMPASS.htseq_counts.tsv
6 https://gdc.xenahubs.net     TCGA-MESO.htseq_counts.tsv
                                                                  url                         fileNames
1     https://gdc.xenahubs.net/download/TCGA-BLCA.htseq_counts.tsv.gz     TCGA-BLCA.htseq_counts.tsv.gz
2     https://gdc.xenahubs.net/download/TCGA-LUSC.htseq_counts.tsv.gz     TCGA-LUSC.htseq_counts.tsv.gz
3     https://gdc.xenahubs.net/download/TCGA-ESCA.htseq_counts.tsv.gz     TCGA-ESCA.htseq_counts.tsv.gz
4     https://gdc.xenahubs.net/download/TARGET-RT.htseq_counts.tsv.gz     TARGET-RT.htseq_counts.tsv.gz
5 https://gdc.xenahubs.net/download/MMRF-COMMPASS.htseq_counts.tsv.gz MMRF-COMMPASS.htseq_counts.tsv.gz
6     https://gdc.xenahubs.net/download/TCGA-MESO.htseq_counts.tsv.gz     TCGA-MESO.htseq_counts.tsv.gz
                                                       destfiles
1     UCSC_Xena/TCGA/counts_RNAseq/TCGA-BLCA.htseq_counts.tsv.gz
2     UCSC_Xena/TCGA/counts_RNAseq/TCGA-LUSC.htseq_counts.tsv.gz
3     UCSC_Xena/TCGA/counts_RNAseq/TCGA-ESCA.htseq_counts.tsv.gz
4     UCSC_Xena/TCGA/counts_RNAseq/TARGET-RT.htseq_counts.tsv.gz
5 UCSC_Xena/TCGA/counts_RNAseq/MMRF-COMMPASS.htseq_counts.tsv.gz
6     UCSC_Xena/TCGA/counts_RNAseq/TCGA-MESO.htseq_counts.tsv.gz
sunta3iouxos commented 3 weeks ago

How can I get using the XENA tools those counts?

https://xenabrowser.net/datapages/?dataset=tcga_RSEM_gene_tpm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 this is what I am looking for RSEM and log(tpm+1)

ShixiangWang commented 3 weeks ago

Hi @sunta3iouxos , please rerun the code with the latest version from GitHub

remotes::install_github("ropensci/UCSCXenaTools")
ShixiangWang commented 3 weeks ago

Hi @sunta3iouxos , please rerun the code with the latest version from GitHub

remotes::install_github("ropensci/UCSCXenaTools")

And XE <- XenaGenerate(subset = XenaHostNames == "gdcHub") changed to XE <- XenaGenerate(subset = XenaHostNames == "gdcHubV18") as UCSC Xena updated the data source.

sunta3iouxos commented 2 weeks ago

I will do and report.

sunta3iouxos commented 1 week ago

This one works. Could you please help with this: "For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the "gene expression RNAseq" dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: here." I understand that this is the TCGAs way to normalise the data to avoid batch effects is done by using this EB++ algorithm, but they also stating that if you need to add your own dataset maybe it is better to normalized by percentile ranking. Any clues on how to do this? I have never normalised data using that approach.

Is this approach something related to this: https://www.nature.com/articles/s41598-020-72664-6#Sec2

ShixiangWang commented 1 week ago

Check https://www.r-bloggers.com/2024/03/mastering-quantile-normalization-in-r-a-step-by-step-guide/ and see more at https://www.google.com/search?q=percentile+normalization+in+r&sca_esv=5487afd26f79d4e0&sxsrf=ADLYWIL88t2cjXP4xQNDR8JUUzRTbtmP2g%3A1731485684107&source=hp&ei=9F80Z9nEBKrh0-kPja2O0Qc&iflsig=AL9hbdgAAAAAZzRuBHEtAsgdwPxbLON8SrenTMM22rhN&ved=0ahUKEwjZjojp7tiJAxWq8DQHHY2WI3oQ4dUDCBY&uact=5&oq=percentile+normalization+in+r&gs_lp=Egdnd3Mtd2l6Ih1wZXJjZW50aWxlIG5vcm1hbGl6YXRpb24gaW4gcjIFECEYoAFI4TdQAFilNnAAeACQAQCYAeABoAH-KKoBBjAuMjYuNbgBA8gBAPgBAvgBAZgCF6ACuh_CAgUQABiABMICCBAAGIAEGMsBwgIEEAAYHsICCBAAGAUYChgewgIGEAAYBRgewgIGEAAYCBgewgIIEAAYgAQYogSYAwCSBwYwLjE4LjWgB_F9&sclient=gws-wiz