Feature request: Allow/retain strata fill mapping when only one level is present

RiversPharmD commented 1 year ago

Is your feature request related to a problem? Please describe. I often loop over subsets of data to generate survival plots for the sub-populations. When the stratifying variable is only present in one level in the sub-population, the formatting for that variable is lost. Because it is lost at the fitting of the survival curve, changing the scalefill or the scalecolor values do not solve this problem.

Describe the solution you'd like A perfect solution would be one that checks if any levels of the factor used to stratify the survival plot are not present in the survfit()/survfit2() call. If they are not present, it automatically reformats the fill and color variables, as well as the legend, to contain all levels of the factor. This could be controlled by a boolean input to the ggsurvfit() function, such as ggsurvfit(..., drop_unused_factors = TRUE)

Describe alternatives you've considered Alternatives to this solution would be directly modifying the raw data to include a single patient with 0 survival time, but this runs the risk of contaminating downstream analyses. I'm not familiar enough with the structure of the output of the survfit() call, but this would be another opportunity to address the missing factors. I think that because the "problem" is associated with the visualization function, the cleanest solution is one that is close to the source of the problem.

Additional context Here's a reprex with dummy data to illustrate what I'm experiencing

library(tidyverse)
library(ggsurvfit)

pat_id <- 1:100
vec_var <- c("TP53", "DPYD")
vec_state <- c("VT", "MN")

dat <- as_tibble(pat_id) |>
  mutate(cat_variant = factor(case_when(pat_id <=75  ~ vec_var[1],
                                        TRUE ~ vec_var[2]), levels = vec_var),
         cat_state = factor(case_when(pat_id <=50 ~ vec_state[1],
                                      TRUE ~ vec_state[2]), levels = vec_state)) |>
  rowwise() |>
  mutate(
    days_survived = runif(1, 100, 500),
    status = rbinom(1,1,0.25) ) |>
  ungroup()

surv_out <- list()
for (i in seq_along(vec_state)) {
  state <- vec_state[i]
  dat_loop <- dat |>
    filter(cat_state == state)

  surv_loop <- survfit2(Surv(days_survived, status) ~ cat_variant, data = dat_loop)
  surv_plot <- surv_loop |>
    ggsurvfit() +
    add_risktable(
      risktable_stats = c("n.risk", "cum.censor", "cum.event"),
      risktable_group = "risktable_stats"
    ) +
    add_risktable_strata_symbol() +
    scale_y_continuous(
      limits = c(0, 1),
      labels = scales::percent,
      expand = c(0.01, 0)
    ) +
    add_quantile(y_value = 0.5) +
    theme_minimal() +
    theme(legend.position = c(0.85, 0.85)) +
    guides(color = guide_legend(ncol = 1))
  surv_out[[i]] <- surv_plot
}
surv_out[[1]]

surv_out[[2]]

^{Created on 2023-06-21 with reprex v2.0.2}

ddsjoberg commented 1 year ago

Thanks for the post @RiversPharmD .

As you noted, it's a bit tricky because the survfit() function removes the unobserved levels, before any function from the ggsurvfit pkg is called.

Can you provide more details on what you want to see when there are unobserved levels? Are you looking to have the colors match when there is only one level and when there is more than one level? Do you expect the unobserved levels to appear in the legend? In the risktable?

RiversPharmD commented 1 year ago

Of course! I don't necessarily expect this to be fixed by y'all, but I would love it.

My naive exploration of this issue suggests two phenotypes, one that occurs when we go from n levels to n-1 levels where n>2, and one where we go from two levels to one. In the general case, I think the end user can maintain the level formatting in the legend using scale_*_discrete calls with drop = FALSE and/or by playing with/inserting breaks. I think that use case is relatively trivial for someone who's worked with ggplot, and don't think this package needs to address it.

For the specific case where we go from 2 levels to 1, in rank order of what I would selfishly prefer being solved: 1) consistent color matching and legend label regardless of levels absent/present in figure 2) Ability to add a 0 row to the risk table, without formatting/color appearing for it 3) Formatting retained for the risk table.

Happy to find time to chat if that's easier.

ddsjoberg commented 1 year ago

I think this kind of update would be a lot of work with many edge cases that would need to be accounted for.

Just an FYI, I don't have immediate plans to look into this, but I'll leave the issue open for review in the future.

Couple of notes for myself:

We would need the user to use survfit2() so we can always access the initial data frame
We would need to add an argument to ggsurvfit() to not not drop unused levels (which in reality, would be adding the unobserved levels of the factor to the strata column returned by the tidier).
There are MANY places in the package internally that do a quick predicate check whether the model is stratified by looking for a strata column. every check would need to be updated to further check if the strata column has more than one level, when appropriate.
the same updates would be needed for ggcuminc()

RiversPharmD commented 1 year ago

No stress @ddsjoberg, I appreciate you taking the time to consider this feedback/feature.

karl-an commented 10 months ago

I have a request similar to the one here. I have two groping variables (à two levels) that I want to use. this is possible in the formula, but I cannot assign different aes to the grouping variables later (e.g., color and linetype) as everything gets merged in the tidy step. I ended up building a ggplot from scratch but it would be nice to be able to do this out of the box.

pharmaverse / ggsurvfit

Feature request: Allow/retain strata fill mapping when only one level is present #155