jgellar / pcox

Penalized Cox regression models
1 stars 0 forks source link

A thought... simplifying the logic using tt functions #7

Closed jgellar closed 9 years ago

jgellar commented 9 years ago

Hi guys,

I've been playing around with the historical cox model. I have realized that the code that is used in a lot of the post-estimation steps (extracting the coefficient function, getting predictions, etc.) is very similar to the code that is contained in the tt function. In particular, all of these functions require setting up the data in exactly the same way, so that it's interpretable by mgcv. The main differences are:

There are probably some other differences that we'll run into, but there are a lot more commonalities. I has occurred to me that we could save ourselves a lot of if/then logic in the methods if, instead of the methods having to determine the type of term (e.g., if it's a historical Cox term, a scalar linear time-varying term, a concurrent time-varying covariate, etc. etc. etc.), that we just save the tt functions themselves. The tt constructor function can have an argument sm which has an optional added smooth term, which defaults to NULL. Or it can just do sm <- NULL somewhere in the construction. If is.null(sm), then the tt function will call smoothCon(), and then pterm(). If a smooth object is supplied, then the tt function calls PredictMat(). After we create the tt function and use it to fit the model, we can edit its environment to add the appropriate smooth object.

This makes designing the methods MUCH simpler. All you do is find the appropriate tt function, and then feed them the correct x.var and t.var. We don't have to figure out the type of term it is, because that will already be built into the constructor function. In fact, we could do this same thing for terms that don't even require a tt term. For example, we can have a simple scalar term in the model. The tt constructor would just create a function like:

tt.func <- function(x.var, ...) x.var

That doesn't even use the t.var argument. It's obviously not necessary to do this when fitting the model, but again it would eliminate the need to have the post-estimation methods decide between tt and non-tt terms.

Here's an example of the constructor function I'm using for the historical Cox terms:

create.tt.func <- function(divide.by.t=FALSE, limits=NULL,
                           method=c("aic", "caic", "epic"),
                           integration=c("trapezoidal", "simpson", "riemann"),
                           eps=1e-6, dbug=FALSE) {
  method <- match.arg(method)
  integration <- match.arg(integration)

  if (is.null(limits))
    # Default
    limits <- "s<=t"

  if (!is.function(limits)) {
    if (is.numeric(limits)) {
      # limits specifies a lag for inclusion
      lag <- limits
      limits <- function(s, t) {s<=t & s>=(t-lag)}
    } else if (limits == "s<t") {
      limits <- function(s, t) {s < t}
    } else if (limits == "s<=t") {
      limits <- function(s, t) {(s <= t)}
    } else {
      stop("supplied <limits> argument unknown")
    }
  }
  sm.in <- NULL

  myfunc <- function(x.var,t.var,...) {
    n <- nrow(x.var)
    J <- ncol(x.var)
    tmat <- matrix(t.var, nrow=n, ncol=J)
    smat <- matrix(1:J, nrow=n, ncol=J, byrow=TRUE)
    mask <- t(outer(smat[1,], tmat[,1], limits))
    #L <- pcox:::getL(smat, integration=integration, n.int = tmat[,1])
    L <- pcox:::getL3(smat, integration=integration, mask=mask)
    x.var[is.na(x.var)] <- 0
    rownames(x.var) <- NULL
    LX <- L*x.var
    if (divide.by.t)
      LX <- LX/matrix(sapply(t.var, min, lag), nrow=n, ncol=J)
    pdat <- data.frame(tmat=I(tmat), smat=I(smat), LX=I(LX))

    if (is.null(sm.in)) {
      sm <- smoothCon(s(tmat, smat, by=LX), data=pdat,
                  knots=NULL, absorb.cons=TRUE)[[1]]
      sm.out <<- sm
      pterm(sm, method=method, eps=eps)
    } else {
      PredictMat(sm.in, data = pdat)
    }
  }
  if (dbug) {
    debug(myfunc)
  } else if (isdebugged(myfunc))
    undebug(myfunc)
  myfunc
}

It's still a bit rough - yeah I am still using <<- at the moment to extract the smooth object, but that's because this isn't built into pcox yet, I am feeding the tt function directly to coxph manually. That will eventually be changed. For the version in pcox, there will also have to be a lot more logic in the constructor function to choose the type of term. But I think this has the basic idea - if we keep track of a tt function for each term, then we don't have to worry in the methods about how each term sets up the data - this is all built into the function.

What do you think?

fabian-s commented 9 years ago

Every modelterm basically carries around the function that creates/created its design matrix, yeah? And if we need the design matrix (for new data) at some point we simply call that function? I think that's an excellent idea if you can pull it off.

Some pitfalls to watch out for, since functions that are returned by other functions have some peculiarities:

I'm a little skeptical that this will mean less "if/then logic" -- seems to me this would just change where the conditionals happen (inside the function factory create.tt.func instead of in the code calling predict etc) not how many you'd have to write. Or would we have a different create.tt.func for each type of term?

jgellar commented 9 years ago

Thanks for the comments and the things to watch out for. As of right now I don't think create.tt.func functions carry any large objects - it is mostly little flags and scalar parameters that tell the tt function what to do. The only exception would be if I add the smooth term (sm.in) - I just checked the size of one of these, and it is about 12 Mb. You're right that we don't want to save these objects twice, so then it just comes down to how to save them. Do we want them to be saved separately as a list (e.g. object$pcox$smooth or object$smooth), or does it make sense to just match it with the term as the saved tt function? I can't think of a use for this object other than as something bound to the tt function, so perhaps the second option is better.

In terms of the if/then logic, I think it saves some of this because without this method we would have to do the logic twice: once when we create the term, and then again in all the methods. And it would have to be embedded somehow in all of the methods. With this proposal, we only have to go through the logic once, when the term is created.

I am leaning towards different create.tt.func's for each type of term, or maybe for each "group" of terms. If there was just one, then there would need to be more flags/logic built directly into the tt function to determine the type of term and how it is modeled. This means more "overhead" every time it is called, more things to store, etc. (though none of these will likely impact performance because they're all little things). Small differences between terms (e.g., time-varying vs. time-fixed) could be included in the same create.tt.func, but larger differences (e.g., a simple scalar term vs. a historical time-varying covariate term) would get separate ones. I also thing this would be easier for us to read/understand/debug. Also, most of the logic already exists based on the parsing of the formula: if hlf() is used, when we parse the function we could call create.tt.hlf(); if tv(s()) is used, then maybe we call create.tt.s(tv=TRUE) or something like that.

The only negative about multiple functions is that it means there are more functions - it isn't as clean, there is more to document, etc.

What do you think about the idea of using tt terms even when they aren't needed (e.g., for a simple linear effect of a scalar term)? I think this simplifies the coding on our end because there are less decisions to make among term types. The overhead of calling a simple "identity" function is negligible, right?

fabian-s commented 9 years ago

The only exception would be if I add the smooth term (sm.in) - I just checked the size of one of these, and it is about 12 Mb. You're right that we don't want to save these objects twice, so then it just comes down to how to save them. Do we want them to be saved separately as a list (e.g. object$pcox$smooth or object$smooth), or does it make sense to just match it with the term as the saved tt function? I can't think of a use for this object other than as something bound to the tt function, so perhaps the second option is better.

sounds reasonable. i'm not really all that clear on which circumstances would trigger a copy operation -- i think it happens if the same object is references under two different names or from two different environments and then one of the bindings to the object is modified or something like that.

In terms of the if/then logic, [...] we only have to go through the logic once, when the term is created.

yeah, right. excellent!

I am leaning towards different create.tt.func's for each type of term, [...]. I also thing this would be easier for us to read/understand/debug.

yes, definitely. encapsulation is the way to go!

The only negative about multiple functions is that it means there are more functions - it isn't as clean, there is more to document, etc.

won't they be internal anyway (so: don't require extra help files)? could maybe also solve this via an S3 generic create.tt.func that dispatches to the different methods for s, hlf etc objects and then just export/document the generic and keep the methods invisible - or are we using hlf(), s() etc only as formula terms and not as constructors of s - or hlf-class objects?

What do you think about the idea of using tt terms even when they aren't needed (e.g., for a simple linear effect of a scalar term)?

Makes sense to me if it results in a cleaner code structure.