jkropko / coxed

Duration-Based Quantities of Interest and Simulation Methods for the Cox Proportional Hazards Model
23 stars 6 forks source link

when only one time-varying categorical variable is involved, a problem appear in the simulation #3

Closed yellowbridge closed 3 years ago

yellowbridge commented 3 years ago

Dear Prof. Kropko, Jonathan,

Thank you very much for developing this fantastic R package to simulate survival data. I am studying the proportion hazard assumption and extended Cox regression with time-dependent variable.

After learning your tutorial 'https://cran.r-project.org/web/packages/coxed/vignettes/simulating_survival_data.html', I try to build a scenario as below. In this scenario, a dataset with 1, 000 observations and one time-varying categoriacal variable are expected.

beta.out <- data.frame(beta1 = c(rep(-0.5, 50), rep(0.5, 50)))
X <- data.frame(group=sample(c(0,1),prob=c(0.3,0.7),replace=TRUE,1000))
simdata <-
  sim.survdata(
    X = X,    T = 100,
    type = "tvbeta",    beta = beta.out,
    num.data.frames = 1,    xvars = 1
  )

However, I can not pass the error as below. the error implies that data <- dplyr::mutate(data, y = lifetimes) encounter probleme. Estimated Y do not have the same length as the row of the data.

  Error: Problem with `mutate()` input `y`.
  x Input `y` can't be recycled to size 1000.
  i Input `y` is `lifetimes`.
  i Input `y` must be size 1000 or 1, not 100000.
  Run `rlang::last_error()` to see where the error occurred.

So I review the source code and rewrite related steps based on the aforementioned code.

#parameters

N <- 1000
beta.out <- data.frame(beta1 = c(rep(-0.5, 50), rep(0.5, 50)))
X <- data.frame(group=sample(c(0,1),prob=c(0.3,0.7),replace=TRUE,N))

baseline <- baseline.build(T = 100, 
                           knots = 8, spline = TRUE)
T <- max(baseline$time)

beta.mat1 <- data.frame(time = 1:T, one = 1)
beta.mat2 <- data.frame(t(beta.out), one = 1)
beta.mat <- merge(beta.mat1, beta.mat2, by = "one")
beta.mat <- dplyr::select(beta.mat, -one)
colnames(beta.mat) <- gsub(pattern = "X", replacement = "beta", 
                           colnames(beta.mat))
beta.mat <- dplyr::mutate(beta.mat, beta1 = beta1 * 
                            log(time))
beta.mat2 <- dplyr::select(beta.mat, -time)
XB <- apply(as.matrix(beta.mat2), 1, FUN = function(b) {
  as.matrix(X) %*% b
})
survival <- t(apply(XB, 1, FUN = function(x) {
  baseline$survivor^exp(x)
}))
survival <- cbind(1, survival)
lifetimes <- apply(survival, 1, FUN = function(x) {
  z <- diff(x < runif(1))
  r <- ifelse(all(z == 0), T, which.max(z))
  return(r)
})
data <- data.frame(X)
data <- dplyr::mutate(data, y = lifetimes)

data$failed <- !(runif(N) < 0.3)

The same error appears. After checking each steps, I find the beta.mat2 <- dplyr::select(beta.mat, -time) produce a matrix with 100 rows and 100 columns, and the XB <- apply(as.matrix(beta.mat2), 1, FUN = function(b) { as.matrix(X) %*% b}) generate a matrix with 100,000 rows which was greater than the expected value 1000.

I'm stuck in this problem and I sincerely hope to get your help.

Merry Christmas and Best wishes to you!

jkropko commented 3 years ago

Issue appears to be resolved as of version 0.3.7 with the rewrite of the type="tvbeta" functionality of sim.survdata()