Open jgellar opened 9 years ago
The issue above was a reminder to me. Recall Fabian that we talked about a strategy for NA
-checking. Right now, I'm telling coxph()
to use na.pass
- this is because the data for the time-varying covariates will have NA
's in it (for time points after the event times). We talked about the need to really have two different types of NA
's:
NA
's due to the variable-domain structureNA
's due to actual missing dataFor the first case, we want to pass these observations through, and have the tt
function be able to handle everything. For the second case, we want to remove those rows from the dataset.
A common scenario where we would actually induce a lot of missing data is for concurrent time-varying covariates with a time lag. For example, if we use a "5-day lagged" version of the TVC, then we basically have to ignore the first 5 days worth of data. We could probably do this by having that row give an NA
for that column of the design matrix, and somehow label it as a "case 2" missing value (hence, needs to be removed).
Perhaps this is the easiest way to handle this issue:
pcox
(before coxph()
is called), we look at the time-varying covariate matrix and see where we actually need the data. This will be based on the "limits" argument for the term. If the limits would return FALSE
, then that cell of the matrix is not needed.na.omit
or na.exclude
instead of na.pass
in the call to coxph()
.The hard part of step 1 is that we need to do this for every time point that this subject's covariate is involved in (i.e., at all event times that the subject is still in the risk set for), and these times are not available until we get to coxph()
. So that might make this impossible, unless we create the risk set in pcox
- which would basically negate a lot of the work we've done to do everything through the tt
functions.
If we write our own na.action
function, this function will need to know what the limits
argument is. I don't know of any way to do this. So this might not be a realistic option either. I guess we would have to create the na.action
function dynamically within pcox()
, and add all the limits
arguments for each term individually? Then "unpack" those limits
arguments within the na.action
function, and do all the checks in there? What a headache.
Any suggestions are welcome. I am not all that familiar with this na.action
and model.frame
stuff.
seems to me like this could be solved by improving the NA handling in coxph()
itself. anything else will likely be even messsier, as you point out.
you could inject lines into coxph()
with trace()
to replace never-accessed NAs
by 0 before coxpenal.fit()
or fitter()
are called. you know coxph()
much better than me -- maybe inject code replacing never-accessed NA
s with 0s in each tt
-designmatrix after the for (i in 1:ntrans)
-loop?
i used to do something similar to override mgcv's constraint handling before simon wood implemented the kinds of constraint i needed, see here (ctrl-f for "trace"). don't forget to switch off the trace in an on.exit
statement if you try to do it like that.
writing a new na.action()
-function would be an excellent idea, but by itself won't help you here as the na.action
-argument is used by model.frame()
long before the data set is expanded to repeat the event times in the risk set or any tt-terms are evaluated -- the relevant line mf <- eval(temp, parent.frame())
in coxph()
is very early in the function.
I addressed this issue in 6ff9b6d4e4e0679446c578b622b97c835872077a.
The one potential problem is that it wouldn't work when I tried to trace survival::coxph
, so I changed it to coxph
and it works fine. Not sure what the problem is, but my solution does not seem like it is "best practice". Possibly related to namespace issues I discussed in #29.
Eventually we need to do this.
Requires:
things to consider:
NA
- remove the rowNA
's, need to see if that value is needed: