jgellar / pcox

Penalized Cox regression models
1 stars 0 forks source link

Singularities #17

Closed jgellar closed 9 years ago

jgellar commented 9 years ago

I just tried to fit the nonlinear historical Cox model for the first time, i.e., $\log h_i(t) = \log h_0(t) + \int \beta(s, t, X_i(s))ds$. Note that in mgcv syntax, this term is converted to: s(smat, tmat, xmat, by=L). The pcox syntax for fitting the model is the following: pcox(Surv(time,event) ~ male + hf(myX, sind = sind2, dbug=TRUE, linear = F). Note that by default mgcv uses a thin-plate regression spline basis with dimension 110 for the 3-dimensional $\beta(s,t,x)$.

The model fit, but I got warning about singularities: X matrix deemed to be singular; variable 102 103 104 105 106 107. I then tried lowering the basis dimension to 50, but a similar result: X matrix deemed to be singular; variable 42 43 44 45 46 47. I tried 25 for one last shot, but got another 6 columns deemed singular. Note that since I'm using thin-plate regression splines, the columns of the design matrix aren't really interpretable.

Do you think this means that there just isn't enough information to estimate the model? Any suggestions?

One thing I was thinking was to try some transformations of the X_i(t) data, similar to what was done in the FGAM paper with the quantile transformation. Do you think this would be helpful? If so, I'd like to discuss some of the mechanics. The FGAM paper does the transformation at every t. We have two time indices (s and t), so the analogous method for our model would be to do the transformation conditional on both indices in order to "fill the whole space". The problem is there is a lot less data high values of t, so this would cause issues.

fabian-s commented 9 years ago

could you post a reproducible example pls? (see https://gist.github.com/hadley/270442)

jgellar commented 9 years ago

Here ya go. This is copied/pasted from pcoxTesting2.R.

library(devtools)
dev_mode()
load_all() # or however you want to load pcox
library(survival)
library(mgcv)
library(plyr)
library(reshape2)

N <- 500
J <- 100
sind <- 1:J
X <- genX(N, seq(0,1,length=J))
beta <- makeBetaMat(J, genBeta1)

eta  <- sapply(1:J, function(j) {
  .75*male + X[,1:j,drop=F] %*% beta[j,1:j]
})
Xdat <- data.frame(myX=I(X), male=male)
data5 <- simTVSurv(eta, Xdat)
sind2 <- 1:ncol(data5$myX)

fit5.2 <- pcox(Surv(time,event) ~ male + hf(myX, sind = sind2, dbug=TRUE, linear = F),
                                      data=data5)
fabian-s commented 9 years ago

example not working for me. please do post something actually reproducible next time to avoid time wasting. :(

male is undefined, if i add

male <- rbinom(N, size = 1, prob=.5)

after finding that line in pcox.testing2.R i get

> fit5.2 <- pcox(Surv(time,event) ~ male + hf(myX, sind = sind2, dbug=TRUE, linear = F),
+   data=data5)
Error in pcox(Surv(time, event) ~ male + hf(myX, sind = sind2, dbug = TRUE,  : 
  could not find function "modify_call"

so i look for "modify_call" and find that i need to load library("pryr").

then after i do that i get

> fit5.2 <- pcox(Surv(time,event) ~ male + hf(myX, sind = sind2, dbug=TRUE, linear = F),
+  data=data5)
  Error in uniquecombs(X) : NA/NaN/Inf in foreign function call (arg 1) 
> traceback()
10 uniquecombs(X) 
9 ExtractData(object, data, knots) 
8 smooth.construct3(object, data, knots) 
7 smoothCon(eval(as.call(newcall)), data = evaldat, knots = NULL, 
    absorb.cons = TRUE) at pcoxTerm.R#96
6 pcoxTerm(data, limits, linear, tv, basistype, sind, integration, 
    standardize, domain, basisargs, method, eps, smooth, t) at create.tt.func.R#78
5 (tt[[i]])(mf[[timetrans$var[i]]], Y[, 1], strats, weights) 
4 coxph(formula = Surv(time, event) ~ male + tt(term2), data = pcoxdata, 
    x = TRUE, tt = tt.funcs, na.action = function (object, ...) 
    object, iter.max = 100, outer.max = 50) 
3 eval(expr, envir, enclos) 
2 eval(newcall) at pcox.R#246
1 pcox(Surv(time, event) ~ male + hf(myX, sind = sind2, dbug = TRUE, 
    linear = F), data = data5) 
jgellar commented 9 years ago

Oops, forgot about male. I thought I put library(pryr) at the top, but did plyr by mistake. Both of these were already loaded in my workspace, and I was trying to avoid closing it out to make sure the MWE worked.

As for the last bug, i fixed this but forgot to update the repo. Re-pull it and it should work.

Sorry about the bugs!

jgellar commented 9 years ago

Btw, the last bug is because there were NA's in the X matrix - these are all outside the "limits". I just replaced these with 0's because they don't matter.

Also note that the model takes about 5 mins to fit (on my laptop).

jgellar commented 9 years ago

Here is another case, where I am fitting a Cox model with a smooth, time-varying effect of a scalar variable (i.e., $\beta(x_i, t)$.

set.seed(1235)
N <- 500
J <- 200
x <- runif(N, 0, 2*pi)
male <- rbinom(N, size = 1, prob=.5)
ftrue <- sin(x)
eta <- matrix(ftrue + .75*male, nrow=N, ncol=J)
Xdat <- data.frame(x=x, male=male)
data2 <- simTVSurv(eta, Xdat)
fit2.2 <- pcox(Surv(time, event) ~ p(x, linear=FALSE, tv=T) +
                 male, data=data2)

Note that the true model is not time-varying, but I'm fitting it as such.

This model will fit faster than the historical model above, only about 5 seconds. I'm only getting one singularity here, but it's not going away with decreasing k.

fabian-s commented 9 years ago

seems to me like the fitting algorithm called by coxpenal.fit might be a little overzalous in declaring rank-deficiency of X, I get

evx <- svd(fit2.2$x)$d^2
evx[1]/tail(evx,1)
## [1] 362772.1
qr(fit2.2$x)$rank
## [1] 30

so a quite high, but not terrible condition number and no rank deficiency detected by the QR.

Taking the penalty into account and looking at (X'X + penalty)-for the p()-part I get:

sm <- fit2.2$pcox$smooth[[1]][[1]]
ev <- eigen(crossprod(sm$X) + sm$S[[1]])$values
ev[1]/tail(ev,1)
## [1] 218168.6

so as expected the penalty somewhat improves the condition of the problem.

misc. thoughts:

jgellar commented 9 years ago

Thanks for digging into it a little. Sorry for the delayed reply - busy couple days over here without much time for pcox.

I haven't even gotten a chance to give this a thorough look-over to see how well the coefficient function is being estimated in these cases: i'll do that soon. It was just concerning that I was getting all these warning messages and they weren't going away when I reduced the basis dimension.

I'm not really familiar with what gam.side(), nat.reparam(), etc. do... i'll look into these at some point to see if there's any way we can use these ideas to improve the design matrix.

jgellar commented 9 years ago

I'm going back through some of the old issues to try to clean some stuff up...

nat.param() and absorption of identifiability constrains should already be being taken care of - these are called within smoothCon() and some of the individual smooth.construct.XX.smooth.spec functions.

I don't think implementing the functionality of gam.side() would be possible for us. It's purpose is to apply constraints to allow over-parameterized models such as s(x) + s(z) + s(x,z) to be fit. The problem is that it requires the whole list of all smooth objects used in fitting the model, and would have to be called between the time that all the smooths are created and the time that the model is fit. Since some of the smooths are created within coxph() as tt terms, this would only be possible (as far as I can figure out) from within coxph() itself, which of course we don't have access to. So unless you can think of something else, I don't think we'll have a way of fitting this type of model formula.

If you can figure out a way to do this, re-open the issue and I'd be willing to try it out.

fabian-s commented 9 years ago

Well damn. I don't see how, other than using trace to meddle with coxph internally, but that seems quite daunting as well.

I'm worried that ignoring this might still bite us, because it might be that as soon as you have multiple time-varying terms you are in a setting where you'd need something like gam.side().....

jgellar commented 9 years ago

I tried fitting the model with two time-varying coefficients, and it fit fine (and the estimates looked pretty good).