openbiox / UCSCXenaShiny

📊 An R package for interactively exploring UCSC Xena https://xenabrowser.net/datapages/; Book: https://lishensuo.github.io/UCSCXenaShiny_Book; App online: https://shiny.hiplot.cn/ucsc-xena-shiny/, https://shiny.zhoulab.ac.cn/UCSCXenaShiny
https://openbiox.github.io/UCSCXenaShiny/
GNU General Public License v3.0
94 stars 31 forks source link

TCGA: Molecule -Molecule correlation - Gene methylation and mRNA metrics #263

Closed quiquemedina closed 1 year ago

quiquemedina commented 1 year ago

Hi there UCSCXena Shiny developers!

I have two questions on this functionality.

Q1 There seems to be an inconsistency in the mRNA metric values as compared with the values from the UCSCXena browser.

Example: Gene : RCAN2 PanCan function: vis_gene_cor( Gene1 = "RCAN2", Gene2 = "RCAN2", data_type1 = "methylation", data_type2 = "mRNA", use_regline = TRUE, purity_adj = TRUE, alpha = 0.5, color = "#E8211E", filter_tumor = TRUE )

Output: See attachment

For example, in ESCA, for parient_ID TCGA-2H-A9GF-01, the mean CpG methylation for example sample: 0.4444. This value matches the value from UCSCXena browser (attached df). However, the mRNA value is 2.578, which is quite different from the value in the UCSCXena browser, which is RCAN2 [log2(norm_count +1)] = 7.968

Then, how is then expression estimated in the molecule-molecule correlation in the Shiny app?

Q2 I notice that the functions that deal with the gene methylation attribute are based on the mean of CpG values. I would like to be able to select out from the metrics specific CpG sites and apply quartile 3 for the remaining sites. Could you please provide the underlining code snippet to estimate methylation across the gene CpGs? That snippet will be instrumental.

Result output file from UCSCXena browser: RCAN2 has 25 CpG sites: cg21115430 cg19452802 cg21088534 cg08701952 cg19517238 cg08392126 cg16229376 cg06818823 cg02183231 cg01380710 cg06335741 cg03185843 cg00220575 cg23053977 cg10994263 cg10852698 cg00588393 cg04652496 cg19083007 cg00782811 cg06665622 cg17142149 cg18116815 cg08250135 cg26677394

Regards,

Enrique Medina-Acosta, a mega satisfied user!

TCGA - Molecule -Molecule correlation - Gene methylation and mRNA metrics.pdf

quiquemedina commented 1 year ago

Here is the df RCAN2 methylayion 450k problems UCSCXena mean vs Q3.xlsx

ShixiangWang commented 1 year ago

@quiquemedina Thanks for your feedback, as always. I will handle this in my free time and respond in detail ASAP. :)

ShixiangWang commented 1 year ago

@quiquemedina For Q1, this may be due to the expression dataset for getting the gene expression. UCSCXenashiny uses pan-cancer TPM dataset at https://xenabrowser.net/datapages/?dataset=TcgaTargetGtex_rsem_gene_tpm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

image
ShixiangWang commented 1 year ago

Also, I found the code in Q1 raised an error:

> d = vis_gene_cor(
+   Gene1 = "RCAN2",
+   Gene2 = "RCAN2",
+   data_type1 = "methylation",
+   data_type2 = "mRNA",
+   use_regline = TRUE,
+   purity_adj = TRUE,
+   alpha = 0.5,
+   color = "#E8211E",
+   filter_tumor = TRUE
+ )
More info about dataset please run following commands:
  library(UCSCXenaTools)
  XenaGenerate(subset = XenaDatasets == "jhu-usc.edu_PANCAN_HumanMethylation450.betaValue_whitelisted.tsv.synapse_download_5096262.xena") %>% XenaBrowse()
More info about dataset please run following commands:
  library(UCSCXenaTools)
  XenaGenerate(subset = XenaDatasets == "TcgaTargetGtex_rsem_gene_tpm") %>% XenaBrowse()
Error in data.frame(sample = t2$sample, tissue = t2$tissue, type2 = t2$type2,  : 
  arguments imply differing number of rows: 8808, 18102

It was due to a bug when joining data from two molecular types. I have fixed it. You could install the latest version from Github. The result I got

image

ShixiangWang commented 1 year ago

@quiquemedina

For Q2, the following code could help query the data values. However, currently, no functions to easily analyze the specific CpG sites. For gene-level analysis, the average is used across all CpG sites.

# UCSCXenaTools:::.xena_hosts
z = UCSCXenaShiny:::try_query_value("https://pancanatlas.xenahubs.net", 
"jhu-usc.edu_PANCAN_HumanMethylation450.betaValue_whitelisted.tsv.synapse_download_5096262.xena",
identifiers = c("cg21115430", "cg19452802"), 
samples = NULL, 
use_probeMap = FALSE)
head(z[, 1:5])
> head(z[, 1:5])
           TCGA-OR-A5J1-01 TCGA-OR-A5J2-01 TCGA-OR-A5J3-01 TCGA-OR-A5J4-01 TCGA-OR-A5J5-01
cg21115430          0.2383         0.05912         0.05925          0.2141         0.07447
cg19452802          0.4664         0.10220         0.07793          0.3513         0.83050
ShixiangWang commented 1 year ago

To query the more detailed data from UCSC Xena, I recommend you read https://shixiangwang.github.io/home/en/tools/ucscxenatools-api/ , all of the API functions are implemented in UCSCXenaTools.

quiquemedina commented 1 year ago

@ShixiangWang, many thanks for the expedite response and actions.

For Q2, could you please share with me the underlying R code snippet that filters the CpG sites in a given gene and estimates the mean of valules in for example: Gene : RCAN2 PanCan function: vis_gene_cor( Gene1 = "RCAN2", Gene2 = "RCAN2", data_type1 = "methylation", data_type2 = "mRNA", use_regline = TRUE, purity_adj = TRUE, alpha = 0.5, color = "#E8211E", filter_tumor = TRUE )

Thank you.

ShixiangWang commented 1 year ago

@quiquemedina No CpG sites were filtered out in vis_gene_cor(). In the backend, we directly queried the mean value of gene methylation from the UCSC Xena server.

If I were right, for example, when analyzing RCAN2, you want to remove some CpG sites before calculating the mean methylation level, and then computing the correlation with gene expression? If so, I may take some time to add an option for this function to fill the needs.

quiquemedina commented 1 year ago

@ShixiangWang, thank you for your prompt response. I'd appreciate it if you could implement the feature to filter specific CpG sites and allow for estimating not only mean values but also Q3 metrics. This enhancement will enable more precise data analysis, allowing researchers to better understand the distribution and skewness of their data, especially when outliers might influence the mean. Having the Q3 metrics available can provide a more comprehensive view of the data spread and is essential for certain statistical evaluations.

ShixiangWang commented 1 year ago

@quiquemedina Got it. I will take try to enhance the data query and analysis of methylation data.

ShixiangWang commented 1 year ago

@quiquemedina Please check if this fits your needs. You can install the latest version from Github and test it locally. I will take some time to see how to modify the caller and the web UI.

z = UCSCXenaShiny:::try_query_value(
  "https://pancanatlas.xenahubs.net",
  "jhu-usc.edu_PANCAN_HumanMethylation450.betaValue_whitelisted.tsv.synapse_download_5096262.xena", 
  identifiers = "RCAN2",
  rule_out = c("cg21115430", "cg19452802"), 
  aggr = "Q75",  # for quantile 0.75, more is mean, Q0, Q25, Q50, Q75, Q100
  samples = NULL)
z[, 1:5, drop = F]

> z[, 1:5, drop = F]
                 TCGA-OR-A5J1-01 TCGA-OR-A5J2-01 TCGA-OR-A5J3-01 TCGA-OR-A5J4-01 TCGA-OR-A5J5-01
aggr_methy_value          0.4664          0.2349          0.2413          0.6791          0.7915

Code: https://github.com/openbiox/UCSCXenaShiny/commit/22963201b18981117da8a20b6aea961b5b499927

quiquemedina commented 1 year ago

@ShixiangWang, the snippet worked! I appreciate all the effort and commitment; you always respond to me so promptly.

ShixiangWang commented 1 year ago

I am closing this now. @lishensuo is working on the corresponding UI, it would be available in the near future.