Open josherrickson opened 9 years ago
Just noting below some annoying behavior when trying to access extra "columns". Using drop=FALSE
lets dim
return non-NULL values, but doesn't split the cbind
, whereas not using it returns NULL on vectors. What would happen if you had a "column" that was a matrix of dimension 32x1?
> dim(d2)
[1] 32 3
> dim(d2[,1])
[1] 32 2
> dim(d2[,1,drop=FALSE])
[1] 32 1
> dim(d2[,2])
NULL
> dim(d2[,2,drop=FALSE])
[1] 32 1
What a thicket! Two thoughts:
fill.NAs
warn or stop if the user tries to give it a 2-column response, coupled with adding an example to the fill.NAs
help page showing how to handle this sort of scenario, and a pointer to that example in the warning or stop message.fill.NA
. Before recording some notes relevant to (2), let me say that I'm guessing that (1) will be substantially less work, and in that case I recommend it over (2).
Re (2): Our present assumption is that it'll spit out a data frame, outdf
, with the property that (formula(outdf)
, outdf
) are a suitable (formula, data) argument pair to be passed to glm
(bayesglm
, etc). This requires that the first column of outdf
encode the dependent variable and that the remaining columns all encode independent variables. On the other hand, if we were to replace this with an assumption that (formula(terms(outdf))
, outdf
) are a suitable (formula, data) argument pair, then we could use the terms
attribute of outdf
to record what we want the end model to look like, and thus be more flexible about what's passed in outdf
. (#103 calls for moving a little ways in this direction. A fuller version of that task would have pushed this way more definitively, but I held off on asking for that, at least for now. See discussion of "longer wishlist" in comments.)
We could use this mechanism to avoid adding two-column columns to matrices, and in general to touch only those columns of the data frame given to fill.NAs
that we need to touch in order to perform imputation on them. outdf
might just pass all remaining columns back to the user unchanged, or, in the event that the user has fill.NAs
a genuine formula as an x
argument rather than a data frame, outdf
might be whittled down to imputed variables, imputation flag variables and other variables named in that formula. The nature of the dependent variable, including cbind
's, +
es or whatever, would be conveyed entirely through the terms attribute of outdf
, as we never impute LHS variables.
Perhaps the solution to this issue (and some related issues with trying to pass along strata information) is to return a new object of our own creation (in particular a subclass of forumla
).
Looking at the glm
function, it will call model.frame
on it's arguments. If we passed in a custom object into glm
, we could create our own model.frame
method that would produce the filled in version that we wanted. Likewise, glm
uses the model.response
function to get the Y
object, which can be either a vector or matrix. This is a function, not a method, but we could see how it is implemented to return the right thing (via the terms attributes on what we return from model.frame
).
I'm going to prototype this idea to see if it can handle the simple situation, and then can see if it also works for these edge cases.
I'm optimistic about this approach. I've pushed up a branch with an initial attempt. So far, so good.
One potential benefit but also possible issue is the way this interacts with functions like predict
. On the pro side, since our object gets saved, our methods get called when building the new data frame for prediction. On the con side, with the current implementation, if the new data doesn't have any missing data for a column that was missing in the original data, we get a mismatch on the model matrix size.
It might be possible to take care of the above problem with a special terms object that indicates which columns must be imputed. Right now, there's a basically empty terms function to make sure we get called. This could probably be smarter all around.
This approach looks very very promising. Perhaps it will help us knock out a number of the fill.NAs
related issues in our issue queue. In terms of coherence w/ the ordinary modeling workflow, and with R's presumptions about model building and the like, it's a substantial advance on what we have now.
It's also a relatively large change, the kind that could expose coverage gaps in our testing. @josherrickson , could you review changes in the fillNAObject branch with a view to such gaps, coordinating w/ Mark on fixes & additions? @markmfredrickson, depending on what you and Josh identify this might go into the next release or the one after that; happy to have you decide.
Speaking of gaps in testing, in optmatch 0.9-5 the below fails, as discussed in comments to #103:
> glm(fill.NAs(pr~cap+factor(ne), data=nuclearplants), family=binomial)
Error in model.matrix.default(mt, mf, contrasts) :
model frame and formula mismatch in model.matrix()
Noting this here in interests of getting this into testing suite.
Another something to include in testing suite: check that sandwich:::estfun.glm
generates estimating functions corresponding to generated .NA columns as well as columns present in original data frame, and in general that the new fill.NAs
plays nicely with bread and meat utility functions in the sandwich package. (In this regard it might be helpful to set up a new na.action and corresponding naresid
method; see definition of sandwich:::estfun.glm
.)
Checking in with this after a little hiatus. @markmfredrickson would it seem like a little thing or a big thing to take up this thread from here? (Perhaps the answer is in your comments earlier on this thread, but I couldn't clearly see.) If it's a little thing, then I'd support your trying to update the fillNAObjects branch and applying the fix you note to address the problem you named. The call I made two years ago for a survey of testing gaps shouldn't hold us back any more.
If it seems like more then a little thing to address then this issue and #39 can be tagged "no action", potentially also closed without resolution. (At least, without resolution other than noting the relevant limitations in fill.NAs
documentation and/or adding warnings or error messages.) Anyway looking forward to hearing your thoughts, Mark.
This came up in issue #100. Data frames can have
matrix
"columns", as shown above. These objects arise frommodel.frame
calls on an object with acbind
response. However, these do NOT arise when callingmodel.matrix
such aswhich gets called inside
fill.NAs
.Three possible solutions I see: 1) At the very beginning, rework the data frame into the traditional structure. This would be easiest, but might produce unintended results - I think
fill.NAs
should return data the same size as its argument (sans any____.NA
columns). 2) Hijinks with dimensions of the "columns" of thex
arguments tofill.NAs.
Do-able, but seems fragile. 3) Refactorfill.NAs
to be recursive (literally or figuratively), and make it perform only on vectors. Heavy lifting, but seems the most stable to me. Break down items into vectors, and build them back up in a methodical fashion.Also, how does this interact with the
___.NA
columns? One per "column" or one per column?