sinhrks / ggfortify

Define fortify and autoplot functions to allow ggplot2 to handle some popular R packages.
Other
527 stars 65 forks source link

Error with autoplot on a survfit object with multiple stratifying variables #214

Closed KellenBrosnahan closed 3 years ago

KellenBrosnahan commented 3 years ago

System information

Problem

When using autoplot on a survfit object with multiple stratifying variables, the function returns the following error:

Error in `levels<-`(`*tmp*`, value = as.character(levels)) : 
  factor level [3] is duplicated

I think this is a bug because ?autoplot.survfit does not list any restrictions on number of stratifying variables. See the Cause section for why I think it's a bug in ggfortify.

Reprex

# reprex.R
# Reproduction of autoplot bug.

# Package installations
# install.packages("survival")
# install.packages("ggplot2")
# install.packages("ggfortify")
# install.packages("asympTest")

# Load packages
library(survival)       # Survival analysis
library(ggplot2)        # Prereq for ggfortify
library(ggfortify)      # Package in question
library(asympTest)      # Contains dataset

# Load dataset
dsNames <- data(DIGdata)
dig <- DIGdata
dig$KLevelGrp <- dig$KLEVEL > 4.3

# Create survival objects
survObject <- Surv(time = dig$DWHFDAYS, event = dig$DWHF)
survCurve <- survfit.formula(survObject ~ TRTMT + KLevelGrp, data = dig)

# Reproduce bug
autoplot(survCurve)
# Error in `levels<-`(`*tmp*`, value = as.character(levels)) : 
  # factor level [3] is duplicated

Cause

I've done some digging, and I believe that I've found the source of the bug. Here's my understanding.\ When autoplot.survfit calls fortify.survfit, lines 33-34 (line numbers from getS3method("fortify", "survfit")) read:

groupIDs <- gsub(".*=", "", names(model$strata))
groupIDs <- factor(rep(groupIDs, model$strata), levels = groupIDs)

The default naming of strata by the survfit.formula function for two stratification variables is "var1=value1,var2=value2". This gsub call replaces two substrings. First, it replaces the substring "var1=" with the empty string. Then, it replaces the substring "value1,var2=" with the empty string. This causes the entire stratum name to be reduced to "value2". For multiple stratification, this causes the stratum names to overlap, e.g. "var1=0,var2=0" and "var1=1,var2=0" would both be reduced to "0". This leads to duplicate values for groupIDs, which means it is not a valid levels argument to factor.

Solution

If I understand the cause correctly, the issue is that the gsub takes the pattern ".*=", which replaces any substring before the equals sign. A potential solution would be to change the pattern to not allow commas. I believe that the R regular expression syntax for that is "[^,]*=. This would cause multiply stratified names to become comma-separated, e.g. "var1=value1,var2=value2" would reduce to "value1,value2".

terrytangyuan commented 3 years ago

Would you like to submit a PR to help fix it?

KellenBrosnahan commented 3 years ago

Hello @terrytangyuan, Yes, I would like to do that. However, I'm unfamiliar with the process, and I'm having some trouble. I was able to clone the repo and create a local branch, but when I tried to push the branch with changes (git push origin autoplotSurvfitMultipleStratification), I received the following error message:

ERROR: Permission to sinhrks/ggfortify.git denied to KellenBrosnahan.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Is this the way I'm supposed to be doing things, or am I completely lost? Thanks, Kellen

PS. The contributing guidelines mentioned unit tests, but I'm not sure what those are or how to do them. Could you explain them please?

terrytangyuan commented 3 years ago

Perhaps this might help: https://www.dataschool.io/how-to-contribute-on-github/