quiquemedina commented 1 month ago

Dear UCSCXena developers,

While querying the tcga_surv_get function for the prognostic value of the CNV feature of gene signatures in the latest versions (UCSCXenaShiny version 2.1.0 and UCSCXenaShiny version 2.2.0), both through the browser and the Shiny app, we encountered a discrepancy in the distribution of the effect classes (normal, duplicated, and deleted) compared to previous versions (UCSCXenaShiny v1.1.8).

Specifically, for all genes and gene signatures tested across all 33 types of cancer, the number of samples with a normal pattern is unexpectedly lower or absent than the other classes. This discrepancy seems to be limited to the CNV feature.

To illustrate the observed discrepancy, we present an extreme case where no normal control samples are plotted in the latest version, whereas a significant number of normal samples are plotted in the older UCSCXenaShiny v1.1.8 version.

Details: • Item: (A2M + AGTR1 + CCND3 + CD163 + CYP1B1 + EDN1 + GHSR + GPT + IL15RA + MUC4 + PRL + SDC2 + TIMP2 + TMC8 + YBX3 + ZFP36L2) • Cancer type: BRCA • Survival metric: OS • Cutoff: optimal • Samples: 1060

Plots: 1. Plot using the online UCSCXenaShiny OLD version:

2. Plot using the online UCSCXenaShiny NEW version:

3. Plot using the RStudio app_run of the UCSCXenaShiny NEW version:

4. Plot using the RStudio code snippet of the UCSCXenaShiny NEW version:

Thank you for your attention to this matter.

R code snippet

Loading necessary libraries

library(UCSCXenaShiny) library(survival) library(dplyr)

1. Obtaining data for the item gene signature in the BRCA cohort with CNV profile

data <- tcga_surv_get( item = "(A2M + AGTR1 + CCND3 + CD163 + CYP1B1 + EDN1 + GHSR + GPT + IL15RA + MUC4 + PRL + SDC2 + TIMP2 + TMC8 + YBX3 + ZFP36L2)", # Gene or protein identifier TCGA_cohort = "BRCA", # TCGA BRCA cohort (breast cancer) profile = "cnv", # Molecular profile (in this case, CNV) TCGA_cli_data = dplyr::full_join(load_data("tcga_clinical"), load_data("tcga_surv"), by = "sample") )

2. (Optional) Filter the data if necessary. It may not be necessary in this case.

3. Generating the Kaplan-Meier (K-M) plot

tcga_surv_plot( data = data, # The subset of data returned by tcga_surv_get time = "OS.time", # Time column status = "OS", # Status column cutoff_mode = "Auto", # Custom cutoff mode

cutpoint = c(50, 50), # Cut points (percentile)

cnv_type = c("Duplicated", "Normal", "Deleted"), # Types of CNV profile = "cnv" # Molecular profile (CNV) )

b <- # 3. Generating the Kaplan-Meier (K-M) plot tcga_surv_plot( data = data, # The subset of data returned by tcga_surv_get time = "OS.time", # Time column status = "OS", # Status column cutoff_mode = "Auto", # Custom cutoff mode

cutpoint = c(50, 50), # Cut points (percentile)

cnv_type = c("Duplicated", "Normal", "Deleted"), # Types of CNV profile = "cnv" # Molecular profile (CNV) )

c <- b[["plot"]][["data"]]

export(c, "c.xlsx")

ShixiangWang commented 1 month ago

@lishensuo take a look

lishensuo commented 1 month ago

Thank you for your question. The main reason for the difference is due to the default CNV datasets. In the old version, we used the TCGA pan-cancer gene-level copy number (gistic2_thresholded) dataset for KM survival analysis in the v1 Shiny module. In the new version, we use the TCGA pan-cancer gene-level copy number (gistic2) dataset for KM survival analysis in the v2 Shiny module.

Although the default datasets cannot be changed in the v2 Shiny TPC modules, users can easily select a dataset of interest for the downstream analysis using R code in RStudio.

With your example data, the expected codes are below.

opt_pancan = .opt_pancan
opt_pancan$toil_cnv$use_thresholded_data = TRUE

data <- tcga_surv_get(
  item = "(A2M + AGTR1 + CCND3 + CD163 + CYP1B1 + EDN1 + GHSR + GPT + IL15RA + MUC4 + PRL + SDC2 + TIMP2 + TMC8 + YBX3 + ZFP36L2)", # Gene or protein identifier
  TCGA_cohort = "BRCA", # TCGA BRCA cohort (breast cancer)
  profile = "cnv", # Molecular profile (in this case, CNV)
  TCGA_cli_data = dplyr::full_join(load_data("tcga_clinical"), load_data("tcga_surv"), by = "sample"),
  opt_pancan = opt_pancan
)

tcga_surv_plot(
  data = data, # The subset of data returned by tcga_surv_get
  time = "OS.time", # Time column
  status = "OS", # Status column
  cutoff_mode = "Auto", # Custom cutoff mode
  cnv_type = c("Duplicated", "Normal", "Deleted"), # Types of CNV
  profile = "cnv" # Molecular profile (CNV)
)

PS: The minor difference between Shiny app and R codes is due to the filter operation in shiny app, which will discard some samples without eligible clinical metadata.

ShixiangWang commented 1 month ago

Set an option for both thresholded or not thresholed CNV in app?

lishensuo commented 1 month ago

Set an option for both thresholded or not thresholed CNV in app?

I once considered the idea and thought that it was not necessary. In TPC modules, we aim to provide quick and easy analysis with fixed selection of datasets. And users have the option to switch datasets for flexible analysis in TPC pipelines of shiny app or R codes in local Rstudio, where are both suppported.

quiquemedina commented 1 month ago

Dear @lishensuo,

Thank you very much for the code snippet to set the default CNV dataset.

For our downstream iterative analysis in RStudio, the snippet below resolved the issue:

opt_pancan = .opt_pancan opt_pancan$toil_cnv$use_thresholded_data = TRUE

However, I strongly believe that for browser users, the ability to select the default CNV dataset would be highly beneficial. We have discussed this issue previously, and while I understand your point about providing quick set results, my perspective is that the outcomes from the "vis_unicox_tree" (risk/protection) and "tcgc_surv_get" K-M plot (bad/good prognostics) functions should almost always be congruent. Ensuring congruence requires users to select the same dataset. This is why I advocate allowing browser users to choose either the default gistic2 or the gistic2 thresholded method in the T-P-C-Modules UCSCXenaShiny v2. Additionally, I noted that this feature has already been implemented in the T-P-C pipelines.

Also see: https://github.com/openbiox/UCSCXenaShiny/issues/286

Reagards,

Enrique

lishensuo commented 1 month ago

Thanks for your suggestion. We have modified the .opt_pancan to select the gistic2 thresholds source as the default CNV dataset. You can download the lastest version from Github now or visit the shiny app online after waiting for about a day.

quiquemedina commented 1 month ago

Great! many thanks!

openbiox / UCSCXenaShiny

Discrepancy in CNV Effect Class Distribution in UCSCXenaShiny Versions 2.1.0 and 2.2.0 versus UCSCXenaShiny v1.1.8 #347

Loading necessary libraries

1. Obtaining data for the item gene signature in the BRCA cohort with CNV profile

2. (Optional) Filter the data if necessary. It may not be necessary in this case.

3. Generating the Kaplan-Meier (K-M) plot

cutpoint = c(50, 50), # Cut points (percentile)

cutpoint = c(50, 50), # Cut points (percentile)

c <- b[["plot"]][["data"]]

export(c, "c.xlsx")