Feature request: Smooth survival curve

agdamsbo commented 9 months ago

Is your feature request related to a problem? Please describe. I am working with registry based health data and stepwise survival curves are cosidered sensitive data, and I am therefor not allowed to publish such curves. Publishing a smooth plot would be the solution

Describe the solution you'd like A clear and concise description of what you want to happen. I am aware, that in most cases you would prefer not to smooth the survival line, but there are some valid cases, where this approach is desirable.

Describe alternatives you've considered The only solutions I have found was a StackOverflow with some suggested solutions, that wouldn't work.

Additional context Below is a working example of the solution I have come up with to be able to plot the line and confidence intervals.

As an example, I have included the two standard methods used for smoothing in ggplot2::geom_smooth(), but I wanted to perform the calculations separately to have control over the method selection and to use the ggplot2::geom_ribbon() for confidence intervals.

library(tidyverse)
library(survival)
library(purrr)
library(ggsurvfit)

## Data
x <- survfit2(Surv(time, status) ~ surg, data = df_colon)
df <-
  tidy_survfit(x, type = "survival") %>% dplyr::mutate(survfit = c(list(x),
                                                                   rep_len(list(), dplyr::n() - 1L)))
method <- "gam"

df_split <- split(df,df$strata)

df_smoothed <- purrr::reduce(lapply(c("estimate","conf.low", "conf.high"), function(j) {
  do.call(rbind,
          lapply(seq_along(df_split), function(i) {
            nms <- names(df_split)[i]
            if (method=="loess"){
              y <-
                predict(loess(as.formula(paste0(
                  j[[1]], " ~ time"
                )), data = df_split[[i]]))
            } else if (method=="gam"){
              y <-
                predict(mgcv::gam(as.formula(paste0(
                  j[[1]], " ~ s(time, bs = 'cs')"
                )), data = df_split[[i]]))
            }
            df <- data.frame(df_split[[i]]$time, y, nms)
            names(df) <- c("time", paste0(j[[1]], ".smooth"), "strata")
            df
          }))
}),dplyr::full_join) |> full_join(df)
#> Joining with `by = join_by(time, strata)`
#> Joining with `by = join_by(time, strata)`
#> Joining with `by = join_by(time, strata)`

## Plotting
ggplot(data=df_smoothed) + 
  geom_line(aes(x=time, y=estimate.smooth, color = strata))+
  geom_ribbon(aes(x=time, ymin = conf.low.smooth, ymax = conf.high.smooth, fill = strata), alpha = 0.50)

^{Created on 2023-09-15 with reprex v2.0.2}

ddsjoberg commented 9 months ago

Hi @agdamsbo ! Thank you for the post with the thoughtful examples!

I am certainly sympathetic to this need. But trying to think of a way to get this done with no impact for current users while not adding too many/any arguments to existing functions.

Question: if you can't show steps in your results, are you allowed to show a risktable? The risktable seems to be far more explicit with the information that a few tick marks.

agdamsbo commented 9 months ago

Hi @ddsjoberg. So, the thing is, that the regulations are quite tight (as you might have guessed). I think, there are two different use cases: data evaluation and publication ready plots, right! In the first case, everything is fine, as I am allowed to look at all the data I have access to. Regarding creating publication ready plots (or tables for that matter as with your gtsummary), I am not allowed to publish anything categorised as micro data, which includes anything, where you could pinpoint groups smaller than five. In a risk table, that would mean that eg. cumulative events smaller than five, would have to be relabeled "<5". Same with subjects at risk, and the same goes for the difference between timepoints. If 145 are at risk at two years and 142 are at risk at three years, I can't publish that. This is just to illustrate, that I would do all I can to avoid a regular risk table in my publication, and instead do a more manual less granular table instead. Having the option to create the smoothed plot would be nice though, and useful for others I would think. I realise that (one of) ggsurvfits claim to fame is the neat combination of having everything together, which this would not live up to.

And regarding the smoothing, it may create a smoothed line that slope upwards in some sections, which I have not been able to correct (it is much more profound with "loess" than "gam"). Maybe using a different smoothing function in "gam" would help.

My ultimate wish would be for a "no microdata mode" for both ggsurvfit and gtsummary. I'll just let that hang here and work on the modifications I need and share them, when ready.

Ok, this was the long answer, I think.

agdamsbo commented 9 months ago

So, I had to look into that slope direction. Turns out there's a package, cobs, which can handle this. Thanks to this and this. The possibilities with the cobs package are quite many, so there is room for a lot of tweaking, which may or may not be good for this use-case.

Updated example:

library(tidyverse)
library(survival)
library(purrr)
library(ggsurvfit)
library(cobs)

## Data
plot.type <- "survival"

x <- survfit2(Surv(time, status) ~ surg, data = df_colon)
df <-
  tidy_survfit(x, type = plot.type) %>% dplyr::mutate(survfit = c(list(x),
                                                                  rep_len(list(), dplyr::n() - 1L)))
method <- "cobs"

df_split <- split(df,df$strata)

df_smoothed <- purrr::reduce(lapply(c("estimate","conf.low", "conf.high"), function(j) {
  do.call(rbind,
          lapply(seq_along(df_split), function(i) {
            nms <- names(df_split)[i]
            x = df_split[[i]]$time
            if (method=="loess"){
              y <-
                predict(loess(as.formula(paste0(
                  j[[1]], " ~ time"
                )), data = df_split[[i]]))
            } else if (method=="gam"){
              y <-
                predict(mgcv::gam(as.formula(paste0(
                  j[[1]], " ~ s(time, bs = 'cs')"
                )), data = df_split[[i]]))
            } else if (method=="cobs") {

              if (plot.type=="survival"){
                ## This will make the plot start in (0,1)
                con <- rbind(c( 0,min(x),1))
                ## This ensures a monotonic decreasing slope 
                ## for the estimate, not the CIs
                if (j[[1]]=="estimate"){
                  direction="decrease"
                } else {direction="none"}

              } else if (plot.type=="risk"){
                con <- rbind(c( 0,min(x),0))
                if (j[[1]]=="estimate"){
                  direction="increase"
                } else {direction="none"}
              }

              m <- cobs(x,df_split[[i]][[j]], 
                        constraint=direction, 
                        nknots = 4,
                        pointwise= con,
                        degree = 2,)
              y <- predict(m, x)[, 'fit']
            }

            df <- data.frame(x, y, nms)
            names(df) <- c("time", paste0(j[[1]], ".smooth"), "strata")
            df
          }))
}),dplyr::full_join) |> full_join(df)

## Plotting
ggplot(data=df_smoothed) + 
  geom_line(aes(x=time, y=estimate.smooth, color = strata))+
  geom_ribbon(aes(x=time, ymin = conf.low.smooth, ymax = conf.high.smooth, fill = strata), alpha = 0.50)

I updated the example from my initial post, to avoid setting constraints on the CIs.

^{Created on 2023-09-22 with reprex v2.0.2}

ddsjoberg commented 9 months ago

Hello! Thanks for the update and the examples! They are super!

I don't know if you're involved at all with the pharmaceutical industry, but there is an emerging standard called Analysis Results Model, where all stats are calculated in advanced and and a standardized "Analysis Results Data" object or ARD, can be passed along to create figures.

I am developing this now for tables, with the next thing to tackle being KM figures (and figures generally). I don't have a timeline for you, but I think this will be the best place to introduce this type of smoothing. Adding native support at this moment, I feel, will open some mini flood gates of suggestions for various smoothing methods and which smoothing parameters are best, etc. Using the ARD will allow users to do whatever they want to the data first, then pass it along to ggsurvfit(). This will mean that no additional package dependencies would be added to ggsurvfit, which from a maintenance point of view, is huge.

I am actively speaking with CDISC about the ARD model for complex figures, and hopefully we'll have a more satisfying solution for you soon.

agdamsbo commented 9 months ago

Sure. Yeah, I saw you're involved with the pharmaverse. I'm doing clinical/epidemiological research at Aarhus University, Denmark. ARD and the methods around that really sounds interesting and like something I will try to keep track of. Thank you.

I perfectly understand your hesitancy. It's perfectly fair.

Now the examples are "out" for anyone to use and possibly lookup. I will think about adding them to a small extension-package of my own.

SinomeM commented 5 months ago

Hi Daniel! Tagging along this issue as I have a similar need (I also work with Danish data). I tried being very lazy and changed the geom_step to geom_smooth in the .construct_ggplot function (see here https://github.com/SinomeM/ggsurvfit/commit/ab3288124365428782953fab1dd60717d940c87a) and it seems to work. Of course is not a proper solution for the package but do you think it is OK if I use it for my needs? thanks, Simone

ddsjoberg commented 5 months ago

Looks great to me @SinomeM !

ddsjoberg commented 5 months ago

FYI @agdamsbo @SinomeM

In the last release of ggsurvfit, I exported some of the helper functions used to construct the plots. Here's an example of how you can smooth the survival curves, and bin any counts less than 5 to obscure the true numbers. Hope it helps somewhat!

library(ggsurvfit)
#> Loading required package: ggplot2

sf <- survfit2(Surv(time, status) ~ sex, data = df_lung) 

# build primary plot, smoothed
gg_survfit <-
  tidy_survfit(sf) |> 
  ggplot(aes(x = time, y = estimate, color = strata)) +
  geom_smooth(se = FALSE, method = 'loess', formula = 'y ~ x') +
  scale_x_continuous(breaks = seq(0, 30, by = 6))

# construct risktable, and hide low counts
gg_risktable <-
  tidy_survfit(sf, times = seq(0, 30, by = 6)) |> 
  dplyr::select(time, n.risk, strata) |> 
  # mask low counts
  dplyr::mutate(
    n.risk = ifelse(n.risk < 5, "<5", n.risk)
  ) |> 
  ggplot(aes(x = time, y = strata, label = n.risk)) +
  geom_text() +
  labs(y = NULL) +
  theme_classic()

# combine plots
ggsurvfit_align_plots(list(gg_survfit, gg_risktable)) |> 
  patchwork::wrap_plots(ncol = 1, heights = c(0.85, 0.15))

^{Created on 2024-01-18 with reprex v2.0.2}

SinomeM commented 5 months ago

Thank you Daniel!

agdamsbo commented 5 months ago

This really is great! Thank you very much!

pharmaverse / ggsurvfit

Feature request: Smooth survival curve #164