openbiox / UCSCXenaShiny

📊 An R package for interactively exploring UCSC Xena https://xenabrowser.net/datapages/; Book: https://lishensuo.github.io/UCSCXenaShiny_Book; App online: https://shiny.hiplot.cn/ucsc-xena-shiny/, https://shiny.zhoulab.ac.cn/UCSCXenaShiny
https://openbiox.github.io/UCSCXenaShiny/
GNU General Public License v3.0
86 stars 31 forks source link

TCGA:Molecular Profile Cox analysis #225

Closed zxhu1987 closed 2 years ago

zxhu1987 commented 3 years ago

First, thanks very much to the authors, this is a nice program. Second, I realize the HR_log and p value are different from GEPIA platform or TCGAbiolinks. I want to ask how do you set the cutoff for grouping and which method do you use to calculate HR?HR_log is log2HR or log10HR? PS: It would be awesome to upload a document to describe the functions and parameters (more details than "https://rdrr.io/cran/UCSCXenaShiny/" ). Thanks!

Byronxy commented 3 years ago

Hi, Thanks for your feedback!

  1. we use log (natural logarithms) here. plz refer to: https://github.com/openbiox/UCSCXenaShiny/blob/90bb3c05df51ab6b0f42c2ee05acc3c34973ae7d/R/vis_pancan_value.R#L270-L275
  2. we are plan to update the document to avoid confusion.
  3. The result could be slightly different across different platforms (I hope it's generally consistent). Please refer our documents for the data source.
zxhu1987 commented 3 years ago

Thanks for your reply. I compare the survival and cox analysis between your platform and GEPIA. Here are some issues:

  1. Both of survival and cox analysis are based on gene expression. Most sample numbers are the same from both analyses. Why is the sample number of BRCA from TCGA survival analysis 1084, while the number of TCGA molecular profile Cox analysis is 1098? I guess you use the same dataset. Or you use different datasets for each function analysis?
  2. Let’s take TP53 as an example. I guess you also take RSEM to split the groups. Though the sample number from GEPIA has a tiny difference from yours, you both use the same method to analyze survival and cox. Could you give any clue to explain why the readout is totally different? , e.g, BRCA. Thanks!
ShixiangWang commented 3 years ago

@zxhu1987 Thanks for your question, we will dig into it.

To illustrate how the survival (KM) and Cox are performed in TCGA BRCA cohort with gene expression data.

@Byronxy Please check the datasets used in Cox analysis and how it is performed. @longfei8533 Please check the datasets used in Survival analysis and how it is performed.

Paste the result below so we discuss if they are proper and consistent.

Byronxy commented 2 years ago

Hi, @zxhu1987 I try to reproduce your issue. I check the expression data of TP53 in BRCA and download the table image

I found that in expression profile the tumor sample number was 1099. image In survival analysis the sample number was 1098 due to the missing value. The result was consistent. Could you please provide more details? Thanks

ShixiangWang commented 2 years ago

他指的是生存模块的样本量少了,下面这个少了一些。

需要检查下是过滤的问题还是说用的数据集有点不同

image

ShixiangWang commented 2 years ago

KM分析:https://github.com/openbiox/UCSCXenaShiny/blob/90bb3c05df51ab6b0f42c2ee05acc3c34973ae7d/R/tcga_surv.R#L32

数据

临床:

TCGA_cli_data = dplyr::full_join(
                            load_data("tcga_clinical"),
                            load_data("tcga_surv"),
                            by = "sample"
                          )

表达等数据获取:

gd <- query_pancan_value(item, data_type = profile)

@Byronxy 与你的操作有什么不同的吗?

Byronxy commented 2 years ago

KM分析:

https://github.com/openbiox/UCSCXenaShiny/blob/90bb3c05df51ab6b0f42c2ee05acc3c34973ae7d/R/tcga_surv.R#L32

数据

临床:

TCGA_cli_data = dplyr::full_join(
                            load_data("tcga_clinical"),
                            load_data("tcga_surv"),
                            by = "sample"
                          )

表达等数据获取:

gd <- query_pancan_value(item, data_type = profile)

@Byronxy 与你的操作有什么不同的吗?

https://github.com/openbiox/UCSCXenaShiny/blob/90bb3c05df51ab6b0f42c2ee05acc3c34973ae7d/R/vis_pancan_value.R#L220-L231 我把正常的样本去除了

ShixiangWang commented 2 years ago

KM数据本身也没有什么问题,我发现实际可用的样本应该是1097个,有一个样本的时间数据是NA,@Byronxy 你抽空看看代码里有没有做这种处理。我们这边统一保留生存分析中实际会用到的数据。

data <- tcga_surv_get("TP53", "BRCA")
p <- tcga_surv_plot(data, time = "OS.time", status = "OS")
nrow(attr(p, "data"))
> nrow(attr(p, "data"))
[1] 1097

我在检查下上面这个KM图的问题。

ShixiangWang commented 2 years ago

少了10几个样本的原因是 stage 中存在 X这个阶段,但Shiny界面中没有

@Byronxy 临床上这个 X 阶段有意义吗?stage信息是从ajcc_pathologic_tumor_stage提取到的。

Browse[2]> table(data$stage)

      I      II     III      IV Unknown       X 
    182     622     250      20      11      13 
Byronxy commented 2 years ago

ajcc_pathologic_tumor_stage

stage x代表级别不是很确定的,这个样本分析中可以剔除。

ShixiangWang commented 2 years ago

Thanks for your reply. I compare the survival and cox analysis between your platform and GEPIA. Here are some issues:

  1. Both of survival and cox analysis are based on gene expression. Most sample numbers are the same from both analyses. Why is the sample number of BRCA from TCGA survival analysis 1084, while the number of TCGA molecular profile Cox analysis is 1098? I guess you use the same dataset. Or you use different datasets for each function analysis?
  2. Let’s take TP53 as an example. I guess you also take RSEM to split the groups. Though the sample number from GEPIA has a tiny difference from yours, you both use the same method to analyze survival and cox. Could you give any clue to explain why the readout is totally different? , e.g, BRCA. Thanks!

Hi @zxhu1987, for the first question, we have confirmed they are consistent. However, for survival analysis, multiple filter operations are supported in Shiny. Some cases may be filtered out in the process. In the example above, ~13 cases marked as stage X (stage X is not supported in Shiny UI, so the cases are dropped).

As for the second one, we directly obtain data from UCSC Xena Toil hub in real time. You can read the detail of this project at here. In brief,

The goal of the Toil recompute was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. Using Toil, UCSC's pipeline architecture, we ran this recompute in a little over 3 days on a large AWS cluster for a cost of $1.30 per sample.

I don't know what data GEPIA used. I guess they don't use log transform data or TMP format data, so it can explain the data value difference.

zxhu1987 commented 2 years ago

感谢各位的答复。 对于第一个问题,我重新分析了xena的数据。我目前就检测了几种癌症类型,发现的些许细节报告如下:

你们在提取数据的时候,都过滤掉正常组织,这是常规操作。 但是分析的时候有的癌症把metastasis和recurrent tumor去掉了(比如ESCA, 去掉1个metastasis正好是185, 你们接着又剔除3个TX(CLINICAL_T),然后就剩下182个样本,这个和你们显示出来的样本量匹配),可是有的却保留了(如COAD, xena过滤后掉正常组织和没有检测tp53的组织后正好是288,没有剔除metastasis和recurrent tumor)。

BRCA(1247个原始样本)剔除正常组织和未检测TP53的组织还剩下1104个样品,如果再剔除14个stage X样品,应该还剩下1090个样品。暂时不清楚cox analysis用的1098个样品和survial用的1084个样品怎么算出来的。剔除标准不一致导致些许的结果差异是小问题,可能是你们数据调用出现些许误差。

关于第二个问题,等我用几个不同的R包分析完了再和你们商量一下。 总之,我觉得这个package整体用起来很不错,偶尔有些bug更新一下就行。非常感谢你们的付出! cox analysis UCSC_COX

ESCA survival ESCA_survival

COAD survival COAD_survival

BRCA survival BRCA_survival

ShixiangWang commented 2 years ago

@zxhu1987

首先不同cancer type的处理肯定是一致的,底层是同一个代码,并没有cancer type specific的处理。Cox和survival是作为2个不同的功能模块实现的,使用的数据一致,但处理的代码略有不同,Cox是利用已有的全部数据,并没有任何的数据过滤。Cox report的表格样本量没有除去不能用于Cox分析的(例如最上面的例子中BRCA其实只有1097个样本,有1个样本的OS.time是NA),而survival分析中自带多个筛选器,上面也提到的,针对数据的筛选主要是有个 stage X的一些样本会被过滤掉。所以Cox的样本量显示会 >= Survival显示的。

你感兴趣的话可以查看下相关的源代码:

surv: https://github.com/openbiox/UCSCXenaShiny/blob/master/R/tcga_surv.R

cox: https://github.com/openbiox/UCSCXenaShiny/blob/169e77382fc4b02ac212cefd5f1a7c07eec916f9/R/vis_pancan_value.R#L215

zxhu1987 commented 2 years ago

@ShixiangWang 你说的这些我可以理解。但是从ESCA和COAD的样本数量来看,它们的处理方式明显有区别。如果从sample_type来看, ESCA剔除了metastasis和recurrent tumor,而COAD没有。我相信你们用的同一底层代码,但是现在的结果表明肯定那个环节存在问题导致我看到filter设置不一致的结果。

ShixiangWang commented 2 years ago

@Byronxy 你有空检查下 @zxhu1987 反馈的 metastasis和recurrent tumor 过滤不一致情况,确认下是否在分析函数内还是是之前生成样本type label存在问题。

ShixiangWang commented 2 years ago

查了下用来过滤的tcga_gtex数据,发现本身可能就把一些提到的样本过滤了(ESCA只有182)。@Byronxy 你这个数据集之前是怎么生成的,data_raw没找到相关的记录。回头看看这个数据集生成是否存在一些问题,是否可以更新一下。

tcga_gtex <- load_data("tcga_gtex")
dplyr::count(tcga_gtex, tissue) %>% dplyr::filter(tissue %in% c("ESCA", "COAD"))
  tissue   n
1   COAD 290
2   ESCA 182
ShixiangWang commented 2 years ago

@Byronxy 尽快check下哈,我准备这周发布一个新的版本。

Byronxy commented 2 years ago
> dplyr::count(mRNA_exprSet, type) %>% dplyr::filter(type %in% c("ESCA", "COAD"))
  type   n
1 COAD 329
2 ESCA 195
> mRNA_exprSet_sub <- mRNA_exprSet %>%
+   filter(subtype != "11")
> dplyr::count(mRNA_exprSet_sub, type) %>% dplyr::filter(type %in% c("ESCA", "COAD"))
  type   n
1 COAD 288
2 ESCA 182

把正常组织,barcode结尾为11的去掉之后有182个

Byronxy commented 2 years ago
> dplyr::count(mRNA_exprSet, type) %>% dplyr::filter(type %in% c("ESCA", "COAD"))
  type   n
1 COAD 329
2 ESCA 195
> mRNA_exprSet_sub <- mRNA_exprSet %>%
+   filter(subtype != "11")
> dplyr::count(mRNA_exprSet_sub, type) %>% dplyr::filter(type %in% c("ESCA", "COAD"))
  type   n
1 COAD 288
2 ESCA 182

把正常组织,barcode结尾为11的去掉之后有182个 TCGA预处理的文件以及代码 链接:https://pan.baidu.com/s/14T49oUsBK9FAAxoOrGj--A 提取码:vzax --来自百度网盘超级会员V4的分享

Byronxy commented 2 years ago
> dplyr::count(mRNA_exprSet, type) %>% dplyr::filter(type %in% c("ESCA", "COAD"))
  type   n
1 COAD 329
2 ESCA 195
> mRNA_exprSet_sub <- mRNA_exprSet %>%
+   filter(subtype != "11")
> dplyr::count(mRNA_exprSet_sub, type) %>% dplyr::filter(type %in% c("ESCA", "COAD"))
  type   n
1 COAD 288
2 ESCA 182

把正常组织,barcode结尾为11的去掉之后有182个 TCGA预处理的文件以及代码 链接:https://pan.baidu.com/s/14T49oUsBK9FAAxoOrGj--A 提取码:vzax --来自百度网盘超级会员V4的分享

image

ShixiangWang commented 2 years ago

okay