ucscXena / ucsc-xena-client

Functional genomics browser
Apache License 2.0
57 stars 42 forks source link

double check KM stats using R #408

Open jingchunzhu opened 5 years ago

jingchunzhu commented 5 years ago

double check the KM stats reported here: https://xenabrowser.net/?bookmark=ac5d07ff142ac9467b08faae2545659d using R

Gauravsahadev commented 5 years ago

@jingchunzhu Can I work on this issue?

XD7479 commented 5 years ago

Hi,@jingchunzhu I'm a junior student in Tongji University, China. I'm applying for Xena GSoc 2019. I have been going through the xena project and learning GDC APIs to prepare for the idea about Update GDC data ingestion pipeline and run.

For this issue, I was trying to calculate the stats using R, here is my report. Somehow my results using R is not identical with the results on Xena, I'm not sure whether I did it correctly. Could you kindly review my report below? Please let me know if I missed anything, thanks in advance.

  1. Download required data from Xena to local PC
wget https://tcga.xenahubs.net/download/TCGA.PANCAN.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes.gz .
wget https://pancanatlas.xenahubs.net/download/Survival_SupplementalTable_S1_20171025_xena_sp.gz . 
  1. Calcualte the log-rank stats
#!/usr/bin/env Rscript
#' @author: Xiaoding Yuan
#' @R version: 3.5.1

############ Load Package ####################
library('data.table')
library('survival')
############ Function Definition ####################
getSurvInfo = function(cnv,clin,duration='OS',query_gene='AMY1A',tumor_type='P')
{
  cnv_info <- t(cnv[Sample==query_gene,-1])
  cnv_info <- data.table(sample=rownames(cnv_info),cnv_info)
  # select samples based on the tumor type
  if(!(tumor_type %in% c('P','M','A'))) stop('A(all, included duplicates), P(primary) and M(metastasis)are available for tumor_type.')
  clin <- switch(tumor_type,
                'P' = clin[grep('01$',sample)],
                'M' = clin[grep('06$',sample)],
                'A' = clin
                )
  # merge data
  setkey(cnv_info,sample)
  setkey(clin,sample)
  cnv_clin <- as.data.frame(merge(cnv_info,clin)[,c(duration,paste(duration,'time',sep='.'),'V1'),with=F])
  cnv_clin <- na.omit(cnv_clin)
  # curate surv obj
  sur <- Surv(cnv_clin[,2], cnv_clin[,1])
  cnv_clin[,3] <- sapply(cnv_clin[,3], function(x) ifelse(x>=median(cnv_clin[,3]),1,0))
  # log-rank test
  statistics <- survdiff(sur~., data=cnv_clin[,3,drop=F],rho=0)
  return(statistics)
}

############ Main ####################
# load data
cnv <- fread('Gistic2_CopyNumber_Gistic2_all_data_by_genes.gz')
clin <- fread('Survival_SupplementalTable_S1_20171025_xena_sp.gz')