jgellar / pcox

Penalized Cox regression models
1 stars 0 forks source link

Specifying terms involving concurrent time-varying covariates #5

Closed jgellar closed 9 years ago

jgellar commented 10 years ago

In the pcox.tex document, I included a table in how I envisioned specifying all of the special terms that are accepted in the model formula. For the section on concurrent time-varying covariates (i.e., "traditional" time-varying covariates), I thought the most natural way (from the user's perspective) was to input the functions as a matrix, and have the syntax be exactly the same as the corresponding terms for scalar predictors. In particular:

First, do you agree that this would be natural from the user perspective, or do you have other suggestions? The way this is done most commonly (in coxph()) is to have a long dataset, with "start" and "stop" times indicated for when a particular observation of the TVC is valid.

Now, in terms of the implementation. Note that since all of these terms involve time, they all need to be created as "tt" terms that are called within 'coxph()'. For example, the first case, the tt function would be something like:

function(x,t,...) {
    sapply(1:length(t), function(i)
              x[i,t[i]])
}

For the first two cases listed above, we just need to call is.matrix() to check if it's a time-varying covariate or scalar term, and based on the result, process the term appropriately. For the third case, do we have to write a pcox::s() function? It would start with an is.matrix() call, which if FALSE just passes everything to mgcv::s(), and if TRUE, creates the appropriate "tt" term. However, I could see this causing problems: what if somebody loads pcox, but then wants to fit a regular gam() model using s()? There would be conflicts, right? Or would the correct s() function be called just because it is called from gam(), so it is within an mgcv namespace?

The 4th case above I am not really going to consider until I get further into writing the tv() function.

Jon

adibender commented 10 years ago

As of earlier versions, there will be a conflict. In fact I was curious about this and got quiet some backlash at stackoverflow: http://stackoverflow.com/questions/20693865/s-in-mgcv-doesnt-work-when-vgam-package-loaded. Of course the user will be warned that s() is masked by pcox etc., but I still think this should be made more robust if possible.

jgellar commented 10 years ago

Wow, that backlash was a bit harsh. I completely agree with you that in an ideal world R would be smart enough to know that in a gam() formula, it should look for mgcv:::s(), while in a vglm() formula, it should look for vgam:::s().

Perhaps the solution is to not define a function pcox::s(), but instead some other function (e.g., pcox::s2()), and "mask" this to the user. So the user would still use s() to specify the smooth, and pcox would look for special terms that use s(). However, instead of executing the call to s(), we change the function being called and instead call s2().

This does cause a bunch of annoying problems down the line though, because all the "labeling", variable names, etc. would then have "s2" instead of "s", and I'd rather the user not see this. Also might cause problems with the methods that we eventually write, but we could probably get around all of these (though it would be annoying).

jgellar commented 10 years ago

If we make a function s() for pcox that is internal, would that resolve the issue?

fabian-s commented 10 years ago

First, do you agree that this would be natural from the user perspective, or do you have other suggestions?

Yes, that seems like a good idea. Possible complication: do you want to also accept terms like s(tv(Z))? If yes, formula parsing will become (much) more tricky. If no, pcox:::s() needs good input checking because the proper order of specical term types will not be obvious or logical to users.

RE:

function(x,t,...) {
    sapply(1:length(t), function(i)
              x[i,t[i]])
}

That would work only if all(t %in% 1:ncol(x)) -- is that a reasonable assumption?

For the third case, do we have to write a pcox::s() function? It would start with an is.matrix() call, which if FALSE just passes everything to mgcv::s(), and if TRUE, creates the appropriate "tt" term.

Sounds good to me.

However, I could see this causing problems: what if somebody loads pcox, but then wants to fit a regular gam() model using s()? There would be conflicts, right? Or would the correct s() function be called just because it is called from gam(), so it is within an mgcv namespace? If we make a function s() for pcox that is internal, would that resolve the issue?

Regardless of whether pcox exports s() or not, it should work if we control the formula evaluation / parsing of model terms carefully. We need to make sure that the term-constructors are evaluated in an environment where "s()" evaluates pcox:::s() and not mgcv's s() unless you explicitly do mgcv::s(). That may require messing with the formula's environment in the worst case, or simply using explicit pcox::s or mgcv::s throughout the term evaluation to avoid mishaps. Or we simply call it sm()and avoid the potential for conflicts and non-standard evaluation hackery, while still having a somewhat mnemonic name & label for the sm ooth terms.

@adibender watagwan! never seen SO contributors so ripley-esque. Poor boy. Also: what are you doing loitering on GitHub during your vacation? Switch off the damn computer, take care of your cold and enjoy some time off. You're gonna burn out one of these days.

jgellar commented 10 years ago

Thanks Fabian for the suggestions. I was NOT planning on allowing for s(tv(Z)), and didn't really consider it. I would assume that anyone who wanted to fit such a model would read the help file first, so I think if we are very clear about the formula "rules" in the help file we will be fine.

Yes, the tt function I proposed will only work if the times are positive integers. In the work I've done so far, I'm lucky that this is in fact the case. There are probably a few places in the code where I "assumed" positive integers for the times, which we will need to find and change.

One thought that I had about this, though, was that I don't think the Cox model itself cares what the actual event/censoring times are, just there order. So maybe we can get away with converting all of the times to positive integers? This may speed up computations, because without it there will be a lot of which() statements and indexing. However, when time is included as a smooth term, the real event/censoring times will be needed. There will be fewer of these cases though, so maybe it would be most efficient to convert the times to integers, but for these terms, feed it the original time points (something like s(x, orig.times[t])) where t is the integer version (the indices). What is your intuition on whether this would be helpful?

For the last part, you're going to have to help me figure out how to do this, as I don't fully understand what would be required. There's no reason that we would need to export pcox::s(), correct?

fabian-s commented 10 years ago

I was NOT planning on allowing for s(tv(Z)), and didn't really consider it. I would assume that anyone who wanted to fit such a model would read the help file first

:laughing: you're so funny.

so I think if we are very clear about the formula "rules" in the help file we will be fine.

sure... </sarcasm>. You can always add that check & warning after you got the first couple of bug reports telling you your package doesn't work because s(tv(...)) wouldn't run :grinning:

One thought that I had about this, though, was that I don't think the Cox model itself cares what the actual event/censoring times are, just there order.

Because that's the only thing that's relevant for the risk-set for any given time-point, right? We should still check that, either by doing math (the horror!) or by writing identical data with time-varying covariates in start-stop notation,once with irregular time intervals and once with simple integer-valued time points that represent their ordering and compare results on both datasets. Still seems weird to me it would behave like that --- suppose you have some biomarker measured after 3 days and after 14 days of follow-up, with an event on day 16. Are you saying that data

patient start stop biomarker event
id1 0 3 x11 0
id1 3 14 x12 0
id1 14 16 x13 1

will yield the same results as

patient start stop biomarker event
id1 1 2 x11 0
id1 2 3 x12 0
id1 3 4 x13 1

?

So maybe we can get away with converting all of the times to positive integers? This may speed up computations, because without it there will be a lot of which() statements and indexing. However, when time is included as a smooth term, the real event/censoring times will be needed. There will be fewer of these cases though, so maybe it would be most efficient to convert the times to integers, but for these terms, feed it the original time points (something like s(x, orig.times[t])) where t is the integer version (the indices). What is your intuition on whether this would be helpful?

I'd expect the time spent on indexing/which() to set up the terms to be negligible compared to the time spent on estimation. I also think moving to an artificial integer-time scale (even if results are identical) will make results harder to interpret and post-processing methods (e.g. visualization of fits) harder to program because they will need to reconstruct the original times. Having the linear and smooth terms use different time scales, in a sense, is a likely source of error and confusion in the code as well. I don't think it would be helpful.

For the last part, you're going to have to help me figure out how to do this,

Sure.

There's no reason that we would need to export pcox::s(), correct?

Probably not. IIUC the issue that Andi ran into is due to the fact that mgcv's formulas get evaluated not in an environment that is a direct child of package:mgcv but in the formula's environment itself, which is typically the global workspace. If VGAM is loaded after mgcv, then package:VGAM is lower on the search path than package:mgcv when you're in the global workspace, so VGAM::s gets found first and is evaluated instead of mgcv::s. As I said earlier, we'll just have to control the environment in which the formula is evaluated carefully and/or specify explicitly which s() we need each time we call it.

jgellar commented 10 years ago

The reason the basic Cox model doesn't care about the precise times is because of the proportional hazards assumption - time basically drops out of the model. The only time "information" that is relevant is who is available in the risk set at each time. In the classical time-varying covariate model, the only value of the covariate that affects the hazard at any time is the current one, so again the precise times don't matter. The baseline hazard is completely unspecified by the Cox model, but you can use the Breslow estimator to find it after fitting the Cox model. This estimator is just a step function though, so again the exact times won't matter except in determining where the steps are (which could easily be transformed after the estimation).

In more complicated models (including many of the ones we are fitting), time finds its way into the model in other ways, as part of the smoothing (for time-varying effects) or as a domain we integrate over (for the historical terms). In these cases the precise times do matter. But it wouldn't be very difficult to refer to the actual times for these terms, but the integer times when we need to do the indexing.

You're probably right that it's wasted effort (for now) and will likely be more confusing to convert everything to integer times. Let's just keep it in the back of our minds as an option if setting up the model is taking longer than expected. This might happen if we are fitting on big datasets with lots of time points or uneven spacing of the times.