Need to check NAs in functional/time-varying covariates

jgellar commented 9 years ago

Eventually we need to do this.

Requires:

sind
event times
limits
data

things to consider:

Any time a vector has NA - remove the row
For matrices with NA's, need to see if that value is needed:
- Look at the data required for each event time - this has to occur within the tt function
- Or we could just take a shortcut: e.g., throw an error if problems are suspected

jgellar commented 9 years ago

The issue above was a reminder to me. Recall Fabian that we talked about a strategy for NA-checking. Right now, I'm telling coxph() to use na.pass - this is because the data for the time-varying covariates will have NA's in it (for time points after the event times). We talked about the need to really have two different types of NA's:

NA's due to the variable-domain structure
NA's due to actual missing data

For the first case, we want to pass these observations through, and have the tt function be able to handle everything. For the second case, we want to remove those rows from the dataset.

A common scenario where we would actually induce a lot of missing data is for concurrent time-varying covariates with a time lag. For example, if we use a "5-day lagged" version of the TVC, then we basically have to ignore the first 5 days worth of data. We could probably do this by having that row give an NA for that column of the design matrix, and somehow label it as a "case 2" missing value (hence, needs to be removed).

Perhaps this is the easiest way to handle this issue:

First, somewhere in pcox (before coxph() is called), we look at the time-varying covariate matrix and see where we actually need the data. This will be based on the "limits" argument for the term. If the limits would return FALSE, then that cell of the matrix is not needed.
Any data that is not going to be accessed at all can be replaced by a numeric value (e.g., 0).
Use na.omit or na.exclude instead of na.pass in the call to coxph().

The hard part of step 1 is that we need to do this for every time point that this subject's covariate is involved in (i.e., at all event times that the subject is still in the risk set for), and these times are not available until we get to coxph(). So that might make this impossible, unless we create the risk set in pcox - which would basically negate a lot of the work we've done to do everything through the tt functions.

If we write our own na.action function, this function will need to know what the limits argument is. I don't know of any way to do this. So this might not be a realistic option either. I guess we would have to create the na.action function dynamically within pcox(), and add all the limits arguments for each term individually? Then "unpack" those limits arguments within the na.action function, and do all the checks in there? What a headache.

Any suggestions are welcome. I am not all that familiar with this na.action and model.frame stuff.

fabian-s commented 9 years ago

seems to me like this could be solved by improving the NA handling in coxph() itself. anything else will likely be even messsier, as you point out. you could inject lines into coxph() with trace() to replace never-accessed NAs by 0 before coxpenal.fit() or fitter() are called. you know coxph() much better than me -- maybe inject code replacing never-accessed NAs with 0s in each tt-designmatrix after the for (i in 1:ntrans)-loop?

i used to do something similar to override mgcv's constraint handling before simon wood implemented the kinds of constraint i needed, see here (ctrl-f for "trace"). don't forget to switch off the trace in an on.exit statement if you try to do it like that.

writing a new na.action()-function would be an excellent idea, but by itself won't help you here as the na.action-argument is used by model.frame() long before the data set is expanded to repeat the event times in the risk set or any tt-terms are evaluated -- the relevant line mf <- eval(temp, parent.frame()) in coxph() is very early in the function.

jgellar commented 9 years ago

I addressed this issue in 6ff9b6d4e4e0679446c578b622b97c835872077a.

The one potential problem is that it wouldn't work when I tried to trace survival::coxph, so I changed it to coxph and it works fine. Not sure what the problem is, but my solution does not seem like it is "best practice". Possibly related to namespace issues I discussed in #29.

jgellar / pcox

Need to check NAs in functional/time-varying covariates #11