Open jingchunzhu opened 5 years ago
@jingchunzhu Can I work on this issue?
Hi,@jingchunzhu I'm a junior student in Tongji University, China. I'm applying for Xena GSoc 2019. I have been going through the xena project and learning GDC APIs to prepare for the idea about Update GDC data ingestion pipeline and run.
For this issue, I was trying to calculate the stats using R, here is my report. Somehow my results using R is not identical with the results on Xena, I'm not sure whether I did it correctly. Could you kindly review my report below? Please let me know if I missed anything, thanks in advance.
wget https://tcga.xenahubs.net/download/TCGA.PANCAN.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes.gz .
wget https://pancanatlas.xenahubs.net/download/Survival_SupplementalTable_S1_20171025_xena_sp.gz .
#!/usr/bin/env Rscript
#' @author: Xiaoding Yuan
#' @R version: 3.5.1
############ Load Package ####################
library('data.table')
library('survival')
############ Function Definition ####################
getSurvInfo = function(cnv,clin,duration='OS',query_gene='AMY1A',tumor_type='P')
{
cnv_info <- t(cnv[Sample==query_gene,-1])
cnv_info <- data.table(sample=rownames(cnv_info),cnv_info)
# select samples based on the tumor type
if(!(tumor_type %in% c('P','M','A'))) stop('A(all, included duplicates), P(primary) and M(metastasis)are available for tumor_type.')
clin <- switch(tumor_type,
'P' = clin[grep('01$',sample)],
'M' = clin[grep('06$',sample)],
'A' = clin
)
# merge data
setkey(cnv_info,sample)
setkey(clin,sample)
cnv_clin <- as.data.frame(merge(cnv_info,clin)[,c(duration,paste(duration,'time',sep='.'),'V1'),with=F])
cnv_clin <- na.omit(cnv_clin)
# curate surv obj
sur <- Surv(cnv_clin[,2], cnv_clin[,1])
cnv_clin[,3] <- sapply(cnv_clin[,3], function(x) ifelse(x>=median(cnv_clin[,3]),1,0))
# log-rank test
statistics <- survdiff(sur~., data=cnv_clin[,3,drop=F],rho=0)
return(statistics)
}
############ Main ####################
# load data
cnv <- fread('Gistic2_CopyNumber_Gistic2_all_data_by_genes.gz')
clin <- fread('Survival_SupplementalTable_S1_20171025_xena_sp.gz')
Overall Survival
All samples:
> getSurvInfo(cnv=cnv,clin=clin,duration='OS',query_gene='AMY1A',tumor_type='A')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 4927 1485 1597 7.87 14.5
V1=1 5819 2006 1894 6.63 14.5
Chisq= 14.5 on 1 degrees of freedom, p= 1e-04
Primary tumors:
> getSurvInfo(cnv=cnv,clin=clin,duration='OS',query_gene='AMY1A',tumor_type='P')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 4701 1353 1475 10.08 18.8
V1=1 5514 1837 1715 8.67 18.8
Chisq= 18.8 on 1 degrees of freedom, p= 1e-05
Metastatic tumors:
> getSurvInfo(cnv=cnv,clin=clin,duration='OS',query_gene='AMY1A',tumor_type='M')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 171 91 88.4 0.0778 0.15
V1=1 181 94 96.6 0.0711 0.15
Chisq= 0.2 on 1 degrees of freedom, p= 0.7
Disease Specific Survival:
All samples:
> getSurvInfo(cnv=cnv,clin=clin,duration='DSS',query_gene='AMY1A',tumor_type='A')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 4675 978 1104 14.5 26.7
V1=1 5537 1435 1309 12.2 26.7
Chisq= 26.7 on 1 degrees of freedom, p= 2e-07
Primary tumors:
> getSurvInfo(cnv=cnv,clin=clin,duration='DSS',query_gene='AMY1A',tumor_type='P')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 4536 908 1037 16.2 30
V1=1 5330 1338 1209 13.9 30
Chisq= 30 on 1 degrees of freedom, p= 4e-08
Metastatic tumors:
> getSurvInfo(cnv=cnv,clin=clin,duration='DSS',query_gene='AMY1A',tumor_type='M')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 169 82 80.4 0.0335 0.0652
V1=1 177 85 86.6 0.0310 0.0652
Chisq= 0.1 on 1 degrees of freedom, p= 0.8
Disease Free Interval:
All samples:
> getSurvInfo(cnv=cnv,clin=clin,duration='DFI',query_gene='AMY1A',tumor_type='A')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 2549 508 514 0.0761 0.144
V1=1 2825 585 579 0.0676 0.144
Chisq= 0.1 on 1 degrees of freedom, p= 0.7
Primary tumors:
> getSurvInfo(cnv=cnv,clin=clin,duration='DFI',query_gene='AMY1A',tumor_type='P')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 2549 508 514 0.0761 0.144
V1=1 2825 585 579 0.0676 0.144
Chisq= 0.1 on 1 degrees of freedom, p= 0.7
Metastatic tumors: The amount of samples is defficeint.
Progression Free Interval
All samples:
> getSurvInfo(cnv=cnv,clin=clin,duration='PFI',query_gene='AMY1A',tumor_type='A')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 4843 1646 1796 12.6 23.5
V1=1 5723 2221 2071 10.9 23.5
Chisq= 23.5 on 1 degrees of freedom, p= 1e-06
Primary tumors:
> getSurvInfo(cnv=cnv,clin=clin,duration='PFI',query_gene='AMY1A',tumor_type='P')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 4702 1543 1683 11.7 22
V1=1 5511 2053 1913 10.3 22
Chisq= 22 on 1 degrees of freedom, p= 3e-06
Metastatic tumors:
> getSurvInfo(cnv=cnv,clin=clin,duration='PFI',query_gene='AMY1A',tumor_type='M')
Call:
survdiff(formula = sur ~ ., data = cnv_clin[, 3, drop = F], rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
V1=0 172 128 130 0.0388 0.0753
V1=1 181 143 141 0.0359 0.0753
Chisq= 0.1 on 1 degrees of freedom, p= 0.8
double check the KM stats reported here: https://xenabrowser.net/?bookmark=ac5d07ff142ac9467b08faae2545659d using R