Closed jgellar closed 9 years ago
Every modelterm basically carries around the function that creates/created its design matrix, yeah? And if we need the design matrix (for new data) at some point we simply call that function? I think that's an excellent idea if you can pull it off.
Some pitfalls to watch out for, since functions that are returned by other functions have some peculiarities:
sm.in
in your example code above) is the body of the function they were created in (i.e., the "function factory")create.tt.func
sets up any large-sized R objects that subsequently get modified in some way.I'm a little skeptical that this will mean less "if/then logic" -- seems to me this would just change where the conditionals happen (inside the function factory create.tt.func
instead of in the code calling predict
etc) not how many you'd have to write. Or would we have a different create.tt.func
for each type of term?
Thanks for the comments and the things to watch out for. As of right now I don't think create.tt.func
functions carry any large objects - it is mostly little flags and scalar parameters that tell the tt
function what to do. The only exception would be if I add the smooth term (sm.in
) - I just checked the size of one of these, and it is about 12 Mb. You're right that we don't want to save these objects twice, so then it just comes down to how to save them. Do we want them to be saved separately as a list (e.g. object$pcox$smooth
or object$smooth
), or does it make sense to just match it with the term as the saved tt
function? I can't think of a use for this object other than as something bound to the tt
function, so perhaps the second option is better.
In terms of the if/then logic, I think it saves some of this because without this method we would have to do the logic twice: once when we create the term, and then again in all the methods. And it would have to be embedded somehow in all of the methods. With this proposal, we only have to go through the logic once, when the term is created.
I am leaning towards different create.tt.func
's for each type of term, or maybe for each "group" of terms. If there was just one, then there would need to be more flags/logic built directly into the tt
function to determine the type of term and how it is modeled. This means more "overhead" every time it is called, more things to store, etc. (though none of these will likely impact performance because they're all little things). Small differences between terms (e.g., time-varying vs. time-fixed) could be included in the same create.tt.func
, but larger differences (e.g., a simple scalar term vs. a historical time-varying covariate term) would get separate ones. I also thing this would be easier for us to read/understand/debug. Also, most of the logic already exists based on the parsing of the formula: if hlf()
is used, when we parse the function we could call create.tt.hlf()
; if tv(s())
is used, then maybe we call create.tt.s(tv=TRUE)
or something like that.
The only negative about multiple functions is that it means there are more functions - it isn't as clean, there is more to document, etc.
What do you think about the idea of using tt
terms even when they aren't needed (e.g., for a simple linear effect of a scalar term)? I think this simplifies the coding on our end because there are less decisions to make among term types. The overhead of calling a simple "identity" function is negligible, right?
The only exception would be if I add the smooth term (sm.in) - I just checked the size of one of these, and it is about 12 Mb. You're right that we don't want to save these objects twice, so then it just comes down to how to save them. Do we want them to be saved separately as a list (e.g. object$pcox$smooth or object$smooth), or does it make sense to just match it with the term as the saved tt function? I can't think of a use for this object other than as something bound to the tt function, so perhaps the second option is better.
sounds reasonable. i'm not really all that clear on which circumstances would trigger a copy operation -- i think it happens if the same object is references under two different names or from two different environments and then one of the bindings to the object is modified or something like that.
In terms of the if/then logic, [...] we only have to go through the logic once, when the term is created.
yeah, right. excellent!
I am leaning towards different create.tt.func's for each type of term, [...]. I also thing this would be easier for us to read/understand/debug.
yes, definitely. encapsulation is the way to go!
The only negative about multiple functions is that it means there are more functions - it isn't as clean, there is more to document, etc.
won't they be internal anyway (so: don't require extra help files)?
could maybe also solve this via an S3 generic create.tt.func
that dispatches to the different methods for s
, hlf
etc objects and then just export/document the generic and keep the methods invisible - or are we using hlf()
, s()
etc only as formula terms and not as constructors of s
- or hlf
-class objects?
What do you think about the idea of using tt terms even when they aren't needed (e.g., for a simple linear effect of a scalar term)?
Makes sense to me if it results in a cleaner code structure.
Hi guys,
I've been playing around with the historical cox model. I have realized that the code that is used in a lot of the post-estimation steps (extracting the coefficient function, getting predictions, etc.) is very similar to the code that is contained in the
tt
function. In particular, all of these functions require setting up the data in exactly the same way, so that it's interpretable bymgcv
. The main differences are:tt
function callssmoothCon()
, where as the methods require a smooth object to be input and then callPredictMat()
tt
function would require the original data, thecoef
method would require some sort of standardized or user-input dataset indicating the s/t locations, and thepredict
method would either require a user-input dataset or default to the original data (to extract the linear predictors from the model).There are probably some other differences that we'll run into, but there are a lot more commonalities. I has occurred to me that we could save ourselves a lot of if/then logic in the methods if, instead of the methods having to determine the type of term (e.g., if it's a historical Cox term, a scalar linear time-varying term, a concurrent time-varying covariate, etc. etc. etc.), that we just save the
tt
functions themselves. Thett
constructor function can have an argumentsm
which has an optional added smooth term, which defaults toNULL
. Or it can just dosm <- NULL
somewhere in the construction. Ifis.null(sm)
, then thett
function will callsmoothCon()
, and thenpterm()
. If a smooth object is supplied, then thett
function callsPredictMat()
. After we create thett
function and use it to fit the model, we can edit its environment to add the appropriate smooth object.This makes designing the methods MUCH simpler. All you do is find the appropriate
tt
function, and then feed them the correctx.var
andt.var
. We don't have to figure out the type of term it is, because that will already be built into the constructor function. In fact, we could do this same thing for terms that don't even require att
term. For example, we can have a simple scalar term in the model. Thett
constructor would just create a function like:That doesn't even use the
t.var
argument. It's obviously not necessary to do this when fitting the model, but again it would eliminate the need to have the post-estimation methods decide betweentt
and non-tt
terms.Here's an example of the constructor function I'm using for the historical Cox terms:
It's still a bit rough - yeah I am still using
<<-
at the moment to extract the smooth object, but that's because this isn't built intopcox
yet, I am feeding thett
function directly tocoxph
manually. That will eventually be changed. For the version inpcox
, there will also have to be a lot more logic in the constructor function to choose the type of term. But I think this has the basic idea - if we keep track of att
function for each term, then we don't have to worry in the methods about how each term sets up the data - this is all built into the function.What do you think?