Changes to Formula Syntax

jgellar / pcox

Penalized Cox regression models

1 stars 0 forks source link

Changes to Formula Syntax #9

Closed jgellar closed 9 years ago

jgellar commented 10 years ago

I am considering making a few changes to the syntax for how particular types of terms are specified in the pcox formula.

The first involves concurrent functional terms, e.g. $\beta Z_i(t)$ for functional predictor $Z_i(t)$. Recall that my previous plan was to have these terms indicated just as a matrix in the model formula, i.e. Surv(time,event) ~ Z for a matrix Z. However, this doesn't let us specify the time points that correspond to the columns of Z - it requires making an assumption on how these columns relate to the time variable (i.e., there needs to be one column for each time point). It also doesn't allow for any other options regarding how Z is processed - I can't think of any of these right now that we might want, but it may come up.

A more flexible way of specifying these terms would be as a function, e.g. clf(Z) for a "concurrent linear function" and caf(Z) for a "concurrent additive function". This would correspond to how we allow for historical terms with hlf() and haf(). This also makes it so we don't have to check if Z is a matrix or not to differentiate between scalar and concurrent functional terms.

Another thing I was thinking of doing is dropping the "l" and "a" for both the historical and concurrent functional terms. This would leave us with just hf() and cf(), with an argument additive that defaults to FALSE.

We could also go one step further and combine hf() and cf() into one function with an argument to differentiate between the two. Perhaps call that function hf(), with an argument concurrent that defaults to FALSE? Here, the idea is that both term types are "historical" because the functional domain is time, but we choose to either treat it as concurrent or non-concurrent. Or maybe f() (or ff()?) with an argument like ftype or something general like that, with options historical and concurrent. I think the amount and difficulty of coding is pretty similar with all of these options, it's mostly from the user perspective that I was thinking of.

Finally, for the tv terms: I was thinking of phasing out the tv(hf(X)) syntax and instead giving an argument to hf (or ff or whatever we go with) called tv. Perhaps it should default to TRUE for historical terms and FALSE for concurrent terms? This could be confusing though. We could also do this with other terms: lf(tv=TRUE), af(tv=TRUE), s(tv=TRUE), etc. Doing this would require writing a pcox version of all these functions, which calls the refund/mgcv versions of them within. If we do this, the only reason we would still need a tv() function is for the regular scalar terms: $\beta(t) x$, because these terms otherwise wouldn't be contained in a function.

jgellar commented 10 years ago

Here's another option I was thinking about. We've talked about allowing for three types of functional predictors: baseline functional terms, historical functional terms, and concurrent functional terms. For baseline functional predictors, I was originally planning on using lf(), af(), and lf.vd() to express the terms, in order to maintain consistency with refund syntax. It was also for ease of programming, since we already have these functions and wouldn't have to rewrite them.

I'm now thinking of rewriting them. Instead of allowing for lf, af, and lf.vd terms, how about we use bf() (for baseline function) to cover all three of them? We would then have arguments for the following options: additive, tv (time-varying), and vd (variable domain). Then we would use hf() for historical functional terms, and cf() for concurrent functional terms.

The advantage of this is I think it is cleaner, and easier for users to pick up (especially if they are not familiar with the refund syntax). The disadvantages are that it isn't consistent with the refund syntax (which I originally wanted, but now I don't think is that big of a deal), and it involves writing some new code. In particular, the variable-domain stuff is a bit of a pain - if we want to allow for all of the options that I allowed for in lf.vd(). These include reparameterizations of the coefficient function, which in some cases results in multiple "smooth" objects for a single term. For example, if we want $\beta(s,t) = \beta_1(s) + t*\beta_2(s)$ in a linear interaction model. These features are probably best left for a future version of the software - though I am interested in trying them out on my data so I have a real reason to want to implement it.

fabian-s commented 10 years ago

Another thing I was thinking of doing is dropping the "l" and "a" for both the historical and concurrent functional terms. This would leave us with just hf() and cf(), with an argument additive that defaults to FALSE.

We could also go one step further and combine hf() and cf() into one function with an argument to differentiate between the two.

Instead of allowing for lf, af, and lf.vd terms, how about we use bf() (for baseline function) to cover all three of them?

I think all 3 are a good idea -- how about something like having

tf(Z) = tf(Z, tv = FALSE, limits="s<=t", additive=FALSE)

be the default (i.e. a static historical linear term of a _t_ime-varying _f_unctional covariate), analogously to

bf(Z) = bf(Z,  tv = FALSE, limits="all", additive=FALSE )

specifying a static linear term integrated over its whole domain for a _b_aseline _f_unctional covariate? limits="t" or NULL could be used for concurrent effects.

I don't think changing this syntax will require lots of changes in the unerlying code -- bf() will simply call one of lf(), af(), lf.vd() and then send that on to tv() (or not). Same for tf().

jgellar commented 10 years ago

So would the only difference between tf() and bf() be the default for “limits”?

I thought about combining the historical and baseline functional covariates into one function, because the only real difference is in the “limits” argument. But I didn’t want to do that because the default for the “limits” is different between the two, and specifying it might be a little tricky for users.

I like your idea of having two different functions, with different defaults, but we could just make one of them call the other one. e.g., bf <- function(Z, …) { tv(Z, limits=“all", …) }

Similarly, cf <- function(Z, …) { tv(Z, limits=“t", …) }

Then we just let tv() work through everything. At least I think that was what you were getting at…

One potential issue is that the code for concurrent functions is a bit different from the other function types (if tv=FALSE and additive=FALSE, there is no smooth at all, whereas that’s not the case for the other functional terms), but there is enough overlap that it might be worthwhile.

Jonathan Gellar PhD Candidate Department of Biostatistics Johns Hopkins University Email: jgellar1@jhu.edu Phone: (213) 864-6677 Website: jonathangellar.com

On Nov 12, 2014, at 3:20 PM, Fabian Scheipl notifications@github.com wrote:

Another thing I was thinking of doing is dropping the "l" and "a" for both the historical and concurrent functional terms. This would leave us with just hf() and cf(), with an argument additive that defaults to FALSE.

2.

We could also go one step further and combine hf() and cf() into one function with an argument to differentiate between the two.

3.

Instead of allowing for lf, af, and lf.vd terms, how about we use bf() (for baseline function) to cover all three of them?

I think all 3 are a good idea -- how about something like having

tf(Z) = tf(Z, tv = FALSE, limits="s<=t", additive=FALSE) be the default (i.e. a static historical linear term of a _t_ime-varying _f_unctional covariate), analogously to

bf(Z) = bf(Z, tv = FALSE, limits="all", additive=FALSE ) specifying a static linear term integrated over its whole domain for a _b_aseline _f_unctional covariate? limits="t" or NULL could be used for concurrent effects.

— Reply to this email directly or view it on GitHub.

jgellar commented 10 years ago

In writing these functions, I've realized that I'm still repeating a lot of similar code for the different term types. After thinking about it awhile, I now think it's possible to combine all four major types of terms (scalars, baseline functions, concurrent, and historical) into one function, based on the limits argument. So here is my new plan:

I am going to write a new function, p() (for penalized) - I like this better than sm(). The idea is to be the analogue of mgcv::s(). It will have arguments including tv, additive, and limits. I would like this function to be able to fit ALL term types that we've been talking about, based on the three arguments above. If is.null(limits) (the default) - the terms within p() will be passed directly to s() (or whatever function is indicated by the basistype argument. This will for example be the easiest way to do a smooth of a scalar term (or terms).

I will then write three "convenience" functions: bf(), cf(), and hf(), for baseline, concurrent, and historical functional terms respectively. These functions will just call p() with the appropriate limit argument: limits="all" for baseline, limits="s=t" for concurrent, limits="s<=t" for historical.

What do you guys think? I think this is the way to get the maximum amount of flexibility out of these functions.

fabian-s commented 10 years ago

seems nifty, and very futureproof w.r.t. additional capabilities we might eventually think up if we can get the abstractions right for handing this over to mgcv and writing the converter functions for coxph.penalty objects.

the limit argument simply changes the L-matrix so that the linear functionals sum over the correct parts of the functional covariate's domain, right? so most of the action differentiating bf(), and hf()will be in the function that sets up L? cf() is different, though, because it yields a simpler term that's not a linear functional, but a time-varying covariate -- structurally more like what you'd do for scalar covariates with time-varying effects, right?

having a single wrapper still seems like a good idea.

btw: there are some easy efficiency gains to be had when setting up design matrices for linear functional terms with limits = "s<t" or similar not completely naively, check out the code around the shift_and_shorten()-utility in pffr().

jgellar commented 10 years ago

Yes, the limits argument only affects the L matrix in the cases of bf and hf terms, but affects more than that for the cf terms. Right now, I have it set up so that if it comes across limits=="t" (specifying a concurrent term), it modifies the data (within the tt function) so that the data matrix is converted into a vector. Then I set limits to NULL, so it is treated just like a regular scalar term.

I've kind of gone in circles all day long today on the best way to set this system up. FYI (and as sort of an exercise for myself to get my thoughts straight), here is the pseudocode for what I'm doing:

p():
1. Check if the term has a time-varying aspect.
  - If yes, call create.tt.func()
  - If no, call create.xt.func()
create.tt.func():
1. Process the limits argument
  - For bf and hf terms, turn limits into a function of (s,t)
  - For cf terms, use limits to turn the data into a vector, then set limits <- NULL
2. Check if the term requires any type of smooth/penalty
  - If not, just return the appropriate data
  - If it does, call pcoxTerm()
  - Once pcoxTerm() finishes:
    1. If this is a new term, assign the smooth back to the parent environment with env$smooth[[index]] <- smooth, and return coxph.penalty object
    2. If this is not a new term, return the prediction matrix
create.xt.func():
1. Pre-processing: process limits argument and check if a smooth term is needed
2. If smooth needed, call pcoxTerm(), otherwise just return the data
pcoxTerm():
1. if (!is.null(limits)): functional term
  - Create smat, and L
  - Create evaluation dataset
  - Create appropriate call
2. if (is.null(limits)): nonfunctional term
  - Create evaluation dataset
  - Create appropriate call
3. If no input smooth:
  - Evaluate call to mgcv::s() (or te or t2)
  - Create smooth term via smoothCon()
  - Create coxph.penalty term via pterm()
  - Return the smooth and coxph.penalty objects as a list
4. If there is an input smooth:
  - Call PredictMat() and return prediction matrix
pcox() (when it comes across a special term):
1. Evaluate the p() term in the evaluation environment
2. Check (based on result) if the term is time-varying or not
  - If time-varying:
    1. extract tt function
    2. add env, index, method, and eps to the function environment
    3. add the tt function to the tt function list
    4. mark as tt() in the coxph formula
  - If not time-varying:
    1. Extract and execute the xt function
    2. If the result is a penalized term, save the smooth and assign the coxph.penalty object
    3. If the results is unpenalized, assign the data
    4. Save the xt function

jgellar commented 10 years ago

I'm using the argument additive to indicate that the smooth should be taken over the x variable (possibly in addition to s or t). However, strictly speaking these models are not "additive"... a term $f(x)$ has a multiplicative effect on the hazard. "Additive" in the context of survival models means that the term will add to the hazard (i.e., not on the log scale). I think we should probably use a different argument name. Any suggestions? I don't like anything that I've thought of: smoothx, overx, multiplicative...

fabian-s commented 10 years ago

I kinda like smoothx. what about nonlin? nonlinear? nonlinx ? or basis?x_basis? basisx? or expand?expand_x?

jgellar commented 10 years ago

I think nonlinear works well. Or maybe linear because it's shorter to type? Just have to switch all the !'s, and have it default to TRUE.

fabian-s commented 10 years ago

good idea!

jgellar commented 10 years ago

I'm running into a little issue... there are a few ways around it but I'm not sure which is best.

You might recall that when we create a coxph.penalty object, we need to give it a method to optimize the smoothing parameter (e.g., "AIC", "cAIC", or "EPIC"), as well as eps (convergence level). For some reason you're allowed to do this separately for each term, but I want to make them all the same by including method and eps as arguments of pcox(), not of p().

So method and eps need to be passed from pcox() to pcoxTerm(), which creates the smooth and coxph.penalty objects. Originally I did this by adding them to the environment of the tt function, which works just fine. However, sometimes pcoxTerm() is called within p() as opposed to within a tt function: this is the case when the user wants a smooth that does not involve time at all. So we need to pass method and eps into p() - but I do NOT want the user to be able to enter those, and I don't want them to be included in the documentation or the function declaration. Any ideas on the best way to do that?

FYI, here is what it looks like when one of the p() (or bf() or hf() or cf()) terms is executed:

    for (i in where.pen) {
      # Evaluate term
      trm <- eval(terms[[i]], envir=evalenv, enclos=frmlenv)
      ...

I was hoping that adding method and eps to the evalenv would work, but apparently not (unless I'm doing something else wrong).

fabian-s commented 10 years ago

p() has a ...-arg: maybe you could add method and eps to the call to p() before evaluating it?

i.e., if terms[[i]] is a call-object just do

## check and warn if terms[[i]] already has $eps or $method
terms[[i]]$eps <- eps 
terms[[i]]$method <- method

and then do the eval.

if it's an expressionjust turn it into a call with as.call

EDIT: the nice thing about this is that it would allow overriding the global method and eps for specific terms if necessary..