kassambara / survminer

Survival Analysis and Visualization
https://rpkgs.datanovia.com/survminer/
495 stars 160 forks source link

surv_cutpoint() doesn't show multiple comparisons adjusted p-value #359

Open tjbencomo opened 5 years ago

tjbencomo commented 5 years ago

I'm concerned that the cutpoint returned by surv_cutpoint() does not provide an adjusted p-value from the maxstat package. This leads users to use this cutpoint with the surv_pvalue() function which does not account for multiple comparisons. This will cause an unadjusted p-value overestimating significance to be reported. The tutorial detailing how to determine optimal cutpoints does not mention the need to adjust for multiple comparisons nor does it state that the p-value plotted with ggsurvplot(..., pval=TRUE,...) does not adjust for multiple comparisons.

surv_cutpoint() withinsurv_cutpoint.R sets maxstat to not compute a p-value at all:

for (i in 1:nvar){
    var_i <- variables[i]
    surv_data$var <- data[, var_i]
    max_stat_i <- maxstat::maxstat.test(survival::Surv(time, event) ~ var, data = surv_data,
                                      smethod = "LogRank", pmethod="none",
                                      minprop = minprop, maxprop = 1-minprop,
                                      alpha = alpha)
    res[[var_i]] <- max_stat_i
    if(progressbar) utils::setTxtProgressBar(pb, i)
  }

The maxstat package provides several different methods to approximate the adjusted p-value that is described on pages 4-5 of the maxstat vignette. Monte Carlo simulations can also be used to calculate the exact conditional p-value with pmethod='condMC'.

Expected behavior

surv_cutpoint() should return a dataframe containing the cutpoint, statistic, and adjusted p-value from the maxstat package.

Actual behavior

surv_cutpoint() only returns the cutpoint and statistic without any warning that the p-value needs to be adjusted for multiple comparisons. Documentation detailing how to use this functionality does not warn users about this issue and instead suggests using the unadjusted log rank test to compute the p-value after a cutpoint has been determined.

Steps to reproduce the problem

The difference in p-values can be seen by using the TCGA data from the tutorial on determining cutpoints mentioned above.

survminer's ggsurvplot reports p-value=.051. Maxstat reports p-value ~ 0.473.

library(RTCGA.clinical) # survival times
library(RTCGA.rnaseq) # genes' expression

survivalTCGA(BRCA.clinical, 
             HNSC.clinical) -> BRCA_HNSC.surv

expressionsTCGA(
  BRCA.rnaseq, HNSC.rnaseq,
  extract.cols = c("ABCD4|5826")) %>%
  rename(cohort = dataset,
         ABCD4 = `ABCD4|5826`) %>%
  filter(substr(bcr_patient_barcode, 14, 15) == "01") %>% 
  # only cancer samples
  mutate(bcr_patient_barcode = 
           substr(bcr_patient_barcode, 1, 12)) -> BRCA_HNSC.rnaseq

BRCA_HNSC.surv %>%
  left_join(BRCA_HNSC.rnaseq,
            by = "bcr_patient_barcode") ->
  BRCA_HNSC.surv_rnaseq

BRCA_HNSC.surv_rnaseq <- BRCA_HNSC.surv_rnaseq %>%
  filter(!is.na(cohort))
BRCA_HNSC.surv_rnaseq.cut <- surv_cutpoint(
  BRCA_HNSC.surv_rnaseq,
  time = "times",
  event = "patient.vital_status",
  variables = c("ABCD4")
)
summary(BRCA_HNSC.surv_rnaseq.cut)
plot(BRCA_HNSC.surv_rnaseq.cut, "ABCD4", palette = "npg")
BRCA_HNSC.surv_rnaseq.cat <- surv_categorize(BRCA_HNSC.surv_rnaseq.cut)

fit <- survfit(Surv(times, patient.vital_status) ~ ABCD4,
               data = BRCA_HNSC.surv_rnaseq.cat)
plot = ggsurvplot(
  fit,                     # survfit object with calculated statistics.
  risk.table = TRUE,       # show risk table.
  pval = TRUE,             # show p-value of log-rank test.
  conf.int = TRUE,         # show confidence intervals for 
  # point estimaes of survival curves.
  xlim = c(0,3000),        # present narrower X axis, but not affect
  # survival estimates.
  break.time.by = 1000,    # break X axis in time intervals by 500.
  ggtheme = theme_RTCGA(), # customize plot and risk table with a theme.
  risk.table.y.text.col = T, # colour risk table text annotations.
  risk.table.y.text = FALSE # show bars instead of names in text annotations
  # in legend of risk table
)

print(plot)
print(maxstat.test(Surv(times, patient.vital_status) ~ ABCD4, data=BRCA_HNSC.surv_rnaseq, smethod="LogRank", pmethod="condMC", B=100000))

session_info()

Session info ----------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.3 (2017-03-06)
 system   x86_64, darwin13.4.0        
 ui       RStudio (1.0.136)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       <NA>                        
 date     2019-01-08                  

Packages --------------------------------------------------------------------------------------------------------
 package        * version      date       source                
 assertthat       0.2.0        2017-04-11 CRAN (R 3.3.2)        
 backports        1.1.2        2017-12-13 CRAN (R 3.3.2)        
 base           * 3.3.3        2017-03-07 local                 
 base64enc        0.1-3        2015-07-28 CRAN (R 3.3.0)        
 bindr            0.1          2016-11-13 CRAN (R 3.3.2)        
 bindrcpp       * 0.2          2017-06-17 CRAN (R 3.3.2)        
 broom            0.4.3        2017-11-20 CRAN (R 3.3.2)        
 cellranger       1.1.0        2016-07-27 CRAN (R 3.3.0)        
 cli              1.0.0        2017-11-05 CRAN (R 3.3.2)        
 cmprsk           2.2-7        2014-06-17 CRAN (R 3.3.0)        
 colorspace       1.3-2        2016-12-14 CRAN (R 3.3.2)        
 crayon           1.3.4        2017-09-16 CRAN (R 3.3.2)        
 data.table     * 1.10.4-3     2017-10-27 CRAN (R 3.3.2)        
 datasets       * 3.3.3        2017-03-07 local                 
 devtools         1.13.5       2018-02-18 CRAN (R 3.3.3)        
 digest           0.6.15       2018-01-28 CRAN (R 3.3.3)        
 dplyr          * 0.7.4        2017-09-28 CRAN (R 3.3.2)        
 evaluate         0.10.1       2017-06-24 CRAN (R 3.3.2)        
 exactRankTests   0.8-29       2017-03-01 CRAN (R 3.3.2)        
 forcats        * 0.3.0        2018-02-19 CRAN (R 3.3.3)        
 foreign          0.8-69       2017-06-21 CRAN (R 3.3.2)        
 ggplot2        * 2.2.1        2016-12-30 CRAN (R 3.3.2)        
 ggpubr         * 0.1.6        2017-11-14 CRAN (R 3.3.2)        
 ggsci            2.8          2017-09-30 CRAN (R 3.3.2)        
 ggthemes         3.4.0        2017-02-19 CRAN (R 3.3.2)        
 glue             1.2.0        2017-10-29 CRAN (R 3.3.2)        
 graphics       * 3.3.3        2017-03-07 local                 
 grDevices      * 3.3.3        2017-03-07 local                 
 grid             3.3.3        2017-03-07 local                 
 gridExtra        2.3          2017-09-09 CRAN (R 3.3.2)        
 gtable           0.2.0        2016-02-26 CRAN (R 3.3.0)        
 haven            1.1.0        2017-07-09 CRAN (R 3.3.2)        
 hms              0.4.1        2018-01-24 CRAN (R 3.3.3)        
 htmltools        0.3.6        2017-04-28 CRAN (R 3.3.2)        
 httr             1.3.1        2017-08-20 cran (@1.3.1)         
 jsonlite         1.5          2017-06-01 cran (@1.5)           
 km.ci            0.5-2        2009-08-30 CRAN (R 3.3.0)        
 KMsurv           0.1-5        2012-12-03 CRAN (R 3.3.0)        
 knitr            1.20         2018-02-20 CRAN (R 3.3.3)        
 labeling         0.3          2014-08-23 CRAN (R 3.3.0)        
 lattice          0.20-35      2017-03-25 CRAN (R 3.3.2)        
 lazyeval         0.2.1        2017-10-29 CRAN (R 3.3.2)        
 lubridate        1.7.1        2017-11-03 CRAN (R 3.3.2)        
 magrittr       * 1.5          2014-11-22 CRAN (R 3.3.0)        
 Matrix           1.2-12       2017-11-15 CRAN (R 3.3.2)        
 maxstat        * 0.7-25       2017-03-02 CRAN (R 3.3.2)        
 memoise          1.1.0        2017-04-21 CRAN (R 3.3.2)        
 methods        * 3.3.3        2017-03-07 local                 
 mnormt           1.5-5        2016-10-15 CRAN (R 3.3.0)        
 modelr           0.1.1        2017-07-24 CRAN (R 3.3.2)        
 munsell          0.4.3        2016-02-13 CRAN (R 3.3.0)        
 mvtnorm          1.0-6        2017-03-02 CRAN (R 3.3.2)        
 nlme             3.1-131      2017-02-06 CRAN (R 3.3.2)        
 pacman         * 0.4.6        2017-05-14 CRAN (R 3.3.2)        
 pander           0.6.1        2017-08-06 CRAN (R 3.3.2)        
 parallel         3.3.3        2017-03-07 local                 
 pbdZMQ           0.3-2        2018-01-31 CRAN (R 3.3.3)        
 pkgconfig        2.0.1        2017-03-21 CRAN (R 3.3.2)        
 plyr             1.8.4        2016-06-08 CRAN (R 3.3.0)        
 psych            1.7.8        2017-09-09 CRAN (R 3.3.3)        
 purrr          * 0.2.4        2017-10-18 CRAN (R 3.3.2)        
 R6               2.2.2        2017-06-17 cran (@2.2.2)         
 Rcpp             0.12.15      2018-01-20 CRAN (R 3.3.3)        
 readr          * 1.1.1        2017-05-16 CRAN (R 3.3.2)        
 readxl           1.0.0        2017-04-18 CRAN (R 3.3.2)        
 reshape2         1.4.3        2017-12-11 CRAN (R 3.3.2)        
 rlang            0.2.0        2018-02-20 CRAN (R 3.3.3)        
 rmarkdown        1.9          2018-03-01 CRAN (R 3.3.3)        
 rprojroot        1.3-2        2018-01-03 CRAN (R 3.3.2)        
 rstudioapi       0.7          2017-09-07 CRAN (R 3.3.2)        
 RTCGA          * 1.4.0        2016-10-18 Bioconductor (R 3.3.1)
 RTCGA.clinical * 20151101.4.0 2019-01-08 Bioconductor (R 3.3.3)
 RTCGA.rnaseq   * 20151101.4.0 2019-01-08 Bioconductor (R 3.3.3)
 rvest            0.3.2        2016-06-17 CRAN (R 3.3.0)        
 scales           0.5.0        2017-08-24 CRAN (R 3.3.2)        
 splines          3.3.3        2017-03-07 local                 
 stats          * 3.3.3        2017-03-07 local                 
 stringi          1.1.6        2017-11-17 cran (@1.1.6)         
 stringr        * 1.3.0        2018-02-19 CRAN (R 3.3.3)        
 survival       * 2.41-3       2017-04-04 CRAN (R 3.3.2)        
 survminer      * 0.4.2        2018-01-31 CRAN (R 3.3.3)        
 survMisc       * 0.5.4        2016-11-23 CRAN (R 3.3.2)        
 tibble         * 1.3.4        2017-08-22 CRAN (R 3.3.2)        
 tidyr          * 0.7.2        2017-10-16 CRAN (R 3.3.2)        
 tidyselect       0.2.4        2018-02-26 CRAN (R 3.3.3)        
 tidyverse      * 1.2.1        2017-11-14 CRAN (R 3.3.3)        
 tools            3.3.3        2017-03-07 local                 
 utils          * 3.3.3        2017-03-07 local                 
 viridis          0.5.0        2018-02-02 CRAN (R 3.3.3)        
 viridisLite      0.3.0        2018-02-01 CRAN (R 3.3.3)        
 withr            2.1.1        2017-12-19 CRAN (R 3.3.2)        
 XML              3.98-1.10    2018-02-19 CRAN (R 3.3.3)        
 xml2             1.1.1        2017-01-24 CRAN (R 3.3.2)        
 xtable           1.8-2        2016-02-05 CRAN (R 3.3.0)        
 yaml             2.1.16       2017-12-12 CRAN (R 3.3.2)        
 zoo              1.8-1        2018-01-08 CRAN (R 3.3.2)  
MarcinKosinski commented 5 years ago

Thanks @tjbencomo for the issue. If you look again on the tutorial you've mentioned, which is actually a blog post, you can find first comment from 9 months ago that asks about the same thing.

surv_cutpoint seeks for the cutpoint based on the biggest/lowest value of the logrank statistics and not any statistical test. surv_pvalue as intended to return the pvalue of the single test, since it doesn't know how many tests have you performed in your analysis at all.

One can calculate the pvalue he/she needs and put the direct value to the ggsurvplot function.

On the other hand scientist tend to put the output of summary(lm) or summary(glm) not mentioning those p values are also there not adjusted for multiple tests since there are multiple variables.

tjbencomo commented 5 years ago

Hi @MarcinKosinski thanks for the response. Two questions: 1) The comment refers to multiple biomarkers to test. Doesn't this mean testing multiple variables simultaneously to evaluate if any of the variables are significant biomarkers? I'm referring to making multiple tests on a single variable. 2) So what pvalue would you recommend using? Isn't it necessary to use maxstat's pvalue as maxstat finds the cutpoint with the best logrank statistic. Without maxstat's pvalue approximation, the logrank test will be biased as it doesn't know multiple tests were performed to find the cutpoint being compared.

MarcinKosinski commented 5 years ago

About 2) you don't need any test to find the value that would maximise the logrank test statistic.

About 1) the person asks about multiple biomarkers but mentions multiple testing at the cutpoint selection stage. I would adjust p-value for logrank tests if there are multiple of them performed because you perform one for every considered biomarker.

I wouldn't adjust the single pvalue for a single variable just because the cutpoint came from a method that maximises the logrank statistic but actually doesn't deal the test as the test is not needed in my opinion for the cutpoint selection.

Very interesting discussion

kassambara commented 5 years ago

Hi @tjbencomo,

As stated by @MarcinKosinski, surv_cutpoint() seeks only for the optimal cutpoint.

There is no need to compute any p-value approximations to select the optimal outpoint. This is also why the default value of the pmethod option in the function maxtat() is "none".

I would'n also adjust the single pvalue.

tjbencomo commented 5 years ago

Hi @kassambara, I think my confusion comes down to how maxstat and the log rank test are different. Maxstat's vignette explains maximally select rank statistics as:

The functional relationship between a quantitative or ordered predictor X and a quantitative, ordered or censored response Y is unknown. As a simple model one can assume that an unknown cutpoint µ in X determines two groups of observations regarding the response Y : the first group with X-values less or equal µ and the second group with X-values greater µ. A measure of the difference between two groups with respect to µ is the absolute value of an appropriate standardized two-sample linear rank statistic of the responses. The hypothesis of independence of X and Y can be formulated as H0 : P(Y ≤ y|X ≤ µ) = P(Y ≤ y|X > µ) for all y and µ ∈ R. This hypothesis can be tested as follows. For every reasonable cutpoint µ in X (e.g. cutpoints that provide a reasonable sample size in both groups), the absolute value of the standardized two-sample linear rank statistic |Sµ| is computed. The maximum of the standardized statistics of all possible cutpoints is used as a test statistic for the hypothesis of independence above. The cutpoint in X that provides the best separation of the responses into two groups, i.e. where the standardized statistics take their maximum, is used as an estimate of the unknown cutpoint.

To me, it sounds like maxstat is testing whether the variable X is related to the censored survival variable Y. The null hypothesis is that the values of Y (survival time) are equivalent for the two groups determined by the cutpoint. If the pvalue maxstat reports is significant, then there exists a survival difference between the groups. This sounds very similar to the log rank test which according to Wikipedia:

The logrank test statistic compares estimates of the hazard functions of the two groups at each observed event time ... the null hypothesis (of the two groups having identical survival and hazard functions)

I interpreted this to mean maxstat and the log rank test are comparing very similar (if not the same) hypotheses. Maxstat mentions it uses the absolute value of an appropriate standardized two-sample linear rank statistic of the responses to measure the difference between the groups. I believe this is the log rank statistic as specified by smethod=LogRank in the survminer code. The pvalue reported by maxstat is reporting whether there is a significant survival difference between the two groups stratified by the maximum cutpoint. According to my interpretation, Maxstat's pvalue should be used because computing a test statistic for each cutpoint inflates your Type-I error, increasing the chance of false positives.

Does my interpretation make sense? Although you may not need to compute a pvalue to find the optimal cutpoint, to make any conclusions regarding the hypothesis, it seems you need to account for all the test statistics you calculated to find the maximum test statstic.

MarcinKosinski commented 5 years ago

But the test statistic is a formula. From the formula one can guess the optimal cutpoint. Taking all cutpoints to verify which gives the biggest value of the tests statistic is the blind brute force.

That's still an interesting story whether one should adjust the p-value for a test statistic the moment there have been other statistics calculated. If they were calculated.

None the less, you know you shouldn't dichotomize the continuous variable, and instead take it as a continuous one into the proper statistical model? Problems caused by the categorization are widely spread http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous

I would recommend putting proper functional form of the variable into the Cox model, if assumptions are met, or into a simple parametric Accelerated Failure Time model, instead of treating the estimates of the survival curve as the final research. That's just the start into the analysis.

sob., 12 sty 2019, 05:30: Tomas Bencomo notifications@github.com napisał(a):

I think my confusion comes down to how maxstat and the log rank test are different. Maxstat's vignette explains maximally select rank statistics as:

The functional relationship between a quantitative or ordered predictor X and a quantitative, ordered or censored response Y is unknown. As a simple model one can assume that an unknown cutpoint µ in X determines two groups of observations regarding the response Y : the first group with X-values less or equal µ and the second group with X-values greater µ. A measure of the difference between two groups with respect to µ is the absolute value of an appropriate standardized two-sample linear rank statistic of the responses. The hypothesis of independence of X and Y can be formulated as H0 : P(Y ≤ y|X ≤ µ) = P(Y ≤ y|X > µ) for all y and µ ∈ R. This hypothesis can be tested as follows. For every reasonable cutpoint µ in X (e.g. cutpoints that provide a reasonable sample size in both groups), the absolute value of the standardized two-sample linear rank statistic |Sµ| is computed. The maximum of the standardized statistics of all possible cutpoints is used as a test statistic for the hypothesis of independence above. The cutpoint in X that provides the best separation of the responses into two groups, i.e. where the standardized statistics take their maximum, is used as an estimate of the unknown cutpoint.

To me, it sounds like maxstat is testing whether the variable X is related to the censored survival variable Y. The null hypothesis is that the values of Y (survival time) are equivalent for the two groups determined by the cutpoint. If the pvalue maxstat reports is significant, then there exists a survival difference between the groups. This sounds very similar to the log rank test which according to Wikipedia:

The logrank test statistic compares estimates of the hazard functions of the two groups at each observed event time ... the null hypothesis (of the two groups having identical survival and hazard functions)

I interpreted this to mean maxstat and the log rank test are comparing very similar (if not the same) hypotheses. Maxstat mentions it uses the absolute value of an appropriate standardized two-sample linear rank statistic of the responses to measure the difference between the groups. I believe this is the log rank statistic as specified by smethod=LogRank in the survminer code. The pvalue reported by maxstat is reporting whether there is a significant survival difference between the two groups stratified by the maximum cutpoint. According to my interpretation, Maxstat's pvalue should be used because computing a test statistic for each cutpoint inflates your Type-I error, increasing the chance of false positives.

Does my interpretation make sense? Although you may not need to compute a pvalue to find the optimal cutpoint, to make any conclusions regarding the hypothesis, it seems you need to account for all the test statistics you calculated to find the maximum test statstic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kassambara/survminer/issues/359#issuecomment-453718717, or mute the thread https://github.com/notifications/unsubscribe-auth/AGdazkVR4R6j7vfEhab6qyTmf7OmNyX9ks5vCWTTgaJpZM4Z1HfX .

tjbencomo commented 5 years ago

The maxstat documentation states that they are not guessing the optimal cutpoint, but finding a cutpoint through brute force methods that examine every cutpoint. The hypothesis test conducted by maxstat (and its reported p-value) are critical to conclude whether or not the variable X has any effect on Y. Has survminer been published in a peer reviewed journal you could point me towards to better understand your methodology?

I agree that more extensive survival analysis should be used to make final conclusions, I'm just afraid that the current survminer implementation does not notify users that they need to account for multiple comparisons when using the optimal cutpoint.

pdelangen commented 2 years ago

Hi, I was trying to find a tool to perform to identify this maximal separation cut point, and I arrived to the same conclusions as the author of this issue. The way most users of survminer are going to use this is to first find the cutpoint then compute the pvalue. However this will inflate the test statistic, and it will not have the expected 5% of false positives under a random distribution for a p-value cutoff of 0.05, because you are testing multiple cutpoints (with multiple p-values) and selecting the best one. I strongly advise the authors and the users of this package to use the maxstat pvalue which uses methodologies for correcting the p-value adapted to this particular case. Edit : To illustrate the issue I compared the distribution of p-values from maxstat with the condMC method vs selecting the cutpoint first then computing the logrank p-value on a random dataset. (This is in python for convenience using rpy2) random_dist_surv.pdf As you can see on the jupyter report the condMC method correctly returns an uniform distribution, and the p-values are completely incorrect when selecting the cutpoint first then computing the p-value without any adjustment.

import numpy as np
import pandas as pd
from rpy2.robjects import r, pandas2ri
import rpy2.robjects as ro
from lifelines.statistics import logrank_test
from rpy2.robjects.packages import importr
pandas2ri.activate()
maxstat = importr("maxstat")
survival = importr("survival")
survminer = importr("survminer")

n = 1000
allPvalsMaxRank = []
allPvalsLogRank = []
for i in range(n):
    # Random values, time to event, and events
    df = pd.DataFrame()
    df["Val"] = np.random.normal(0.0,1.0,100)
    df["TTE"] = np.random.randint(1,1000,100)
    df["Event"] = np.random.randint(0,2,100).astype(bool)
    r_dataframe = ro.conversion.py2rpy(df)
    fml = ro.r(f"Surv(TTE, Event) ~ Val")
    # Compute cutoff point and max rank p-value
    mstat = maxstat.maxstat_test(fml, data=r_dataframe, smethod="LogRank", pmethod="condMC")
    pval = mstat.rx2('p.value')[0]
    allPvalsMaxRank.append(pval)
    # P-value after cutoff selection
    gr1 = df["Val"] < mstat.rx2("estimate")[0]
    gr2 = np.logical_not(gr1)
    allPvalsLogRank.append(logrank_test(df["TTE"][gr1], df["TTE"][gr2], df["Event"][gr1], df["Event"][gr2]).p_value)
import matplotlib.pyplot as plt
plt.figure(dpi=300)
plt.title("Distribution of Monte carlo max rank p-values on a random dataset")
plt.hist(allPvalsMaxRank, 20, density=True)
plt.show()
plt.figure()
plt.figure(dpi=300)
plt.title("Distribution of p-values computed after cutoff selection on a random dataset")
plt.hist(allPvalsLogRank, 20, density=True)
plt.show()