markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

`cbind` on LHS of `fill.NAs` #104

Open josherrickson opened 9 years ago

josherrickson commented 9 years ago
> m <- lm(cbind(cost, t1) ~ cap + pr, data=nuclearplants)
> d <- model.frame(m, na.action=na.pass)
> d
  cbind(cost, t1).cost cbind(cost, t1).t1  cap pr
H                   NA              14.00  687  0
I               452.99              10.00   NA  0
A               443.22              10.00 1065  1
J               652.32              11.00 1065  0
B               642.23              11.00 1065  1
....
> dim(d)
[1] 32  3
> dim(d[,1])
[1] 32  2
> fill.NAs(d)
Error in `colnames<-`(`*tmp*`, value = ".NA") :
  attempt to set 'colnames' on an object with less than two dimensions
> fill.NAs(d, all.covs=TRUE)
Error in `colnames<-`(`*tmp*`, value = ".NA") :
  attempt to set 'colnames' on an object with less than two dimensions

This came up in issue #100. Data frames can have matrix "columns", as shown above. These objects arise from model.frame calls on an object with a cbind response. However, these do NOT arise when calling model.matrix such as

 > d2 <- model.matrix(~. + 0, model.frame(m, na.action=na.pass))
 > dim(d2)
 [1] 32  4
> d2
   `cbind(cost, t1)`cost `cbind(cost, t1)`t1  cap pr
 H                    NA                  14  687  0
 I                452.99                  10   NA  0
 A                443.22                  10 1065  1
 J                652.32                  11 1065  0
 B                642.23                  11 1065  1
 K                345.39                  13  514  0
.....

which gets called inside fill.NAs.

Three possible solutions I see: 1) At the very beginning, rework the data frame into the traditional structure. This would be easiest, but might produce unintended results - I think fill.NAs should return data the same size as its argument (sans any ____.NA columns). 2) Hijinks with dimensions of the "columns" of the x arguments to fill.NAs. Do-able, but seems fragile. 3) Refactor fill.NAs to be recursive (literally or figuratively), and make it perform only on vectors. Heavy lifting, but seems the most stable to me. Break down items into vectors, and build them back up in a methodical fashion.

Also, how does this interact with the ___.NA columns? One per "column" or one per column?

josherrickson commented 9 years ago

Just noting below some annoying behavior when trying to access extra "columns". Using drop=FALSE lets dim return non-NULL values, but doesn't split the cbind, whereas not using it returns NULL on vectors. What would happen if you had a "column" that was a matrix of dimension 32x1?

 > dim(d2)
 [1] 32  3
 > dim(d2[,1])
 [1] 32  2
 > dim(d2[,1,drop=FALSE])
 [1] 32  1
 > dim(d2[,2])
 NULL
 > dim(d2[,2,drop=FALSE])
 [1] 32  1
benthestatistician commented 9 years ago

What a thicket! Two thoughts:

  1. I'd also be happy with functionality to make fill.NAs warn or stop if the user tries to give it a 2-column response, coupled with adding an example to the fill.NAs help page showing how to handle this sort of scenario, and a pointer to that example in the warning or stop message.
  2. It may be that the simplest way to get this functionality (now or at some time in the future) would entail rejiggering the assumptions attending to the value of fill.NA.

Before recording some notes relevant to (2), let me say that I'm guessing that (1) will be substantially less work, and in that case I recommend it over (2).

Re (2): Our present assumption is that it'll spit out a data frame, outdf, with the property that (formula(outdf), outdf) are a suitable (formula, data) argument pair to be passed to glm (bayesglm, etc). This requires that the first column of outdf encode the dependent variable and that the remaining columns all encode independent variables. On the other hand, if we were to replace this with an assumption that (formula(terms(outdf)), outdf) are a suitable (formula, data) argument pair, then we could use the terms attribute of outdf to record what we want the end model to look like, and thus be more flexible about what's passed in outdf. (#103 calls for moving a little ways in this direction. A fuller version of that task would have pushed this way more definitively, but I held off on asking for that, at least for now. See discussion of "longer wishlist" in comments.)

We could use this mechanism to avoid adding two-column columns to matrices, and in general to touch only those columns of the data frame given to fill.NAs that we need to touch in order to perform imputation on them. outdf might just pass all remaining columns back to the user unchanged, or, in the event that the user has fill.NAs a genuine formula as an x argument rather than a data frame, outdf might be whittled down to imputed variables, imputation flag variables and other variables named in that formula. The nature of the dependent variable, including cbind's, +es or whatever, would be conveyed entirely through the terms attribute of outdf, as we never impute LHS variables.

markmfredrickson commented 9 years ago

Perhaps the solution to this issue (and some related issues with trying to pass along strata information) is to return a new object of our own creation (in particular a subclass of forumla).

Looking at the glm function, it will call model.frame on it's arguments. If we passed in a custom object into glm, we could create our own model.frame method that would produce the filled in version that we wanted. Likewise, glm uses the model.response function to get the Y object, which can be either a vector or matrix. This is a function, not a method, but we could see how it is implemented to return the right thing (via the terms attributes on what we return from model.frame).

I'm going to prototype this idea to see if it can handle the simple situation, and then can see if it also works for these edge cases.

markmfredrickson commented 9 years ago

I'm optimistic about this approach. I've pushed up a branch with an initial attempt. So far, so good.

One potential benefit but also possible issue is the way this interacts with functions like predict. On the pro side, since our object gets saved, our methods get called when building the new data frame for prediction. On the con side, with the current implementation, if the new data doesn't have any missing data for a column that was missing in the original data, we get a mismatch on the model matrix size.

It might be possible to take care of the above problem with a special terms object that indicates which columns must be imputed. Right now, there's a basically empty terms function to make sure we get called. This could probably be smarter all around.

benthestatistician commented 9 years ago

This approach looks very very promising. Perhaps it will help us knock out a number of the fill.NAs related issues in our issue queue. In terms of coherence w/ the ordinary modeling workflow, and with R's presumptions about model building and the like, it's a substantial advance on what we have now.

It's also a relatively large change, the kind that could expose coverage gaps in our testing. @josherrickson , could you review changes in the fillNAObject branch with a view to such gaps, coordinating w/ Mark on fixes & additions? @markmfredrickson, depending on what you and Josh identify this might go into the next release or the one after that; happy to have you decide.

benthestatistician commented 9 years ago

Speaking of gaps in testing, in optmatch 0.9-5 the below fails, as discussed in comments to #103:

> glm(fill.NAs(pr~cap+factor(ne), data=nuclearplants), family=binomial)
Error in model.matrix.default(mt, mf, contrasts) : 
  model frame and formula mismatch in model.matrix()

Noting this here in interests of getting this into testing suite.

benthestatistician commented 9 years ago

Another something to include in testing suite: check that sandwich:::estfun.glm generates estimating functions corresponding to generated .NA columns as well as columns present in original data frame, and in general that the new fill.NAs plays nicely with bread and meat utility functions in the sandwich package. (In this regard it might be helpful to set up a new na.action and corresponding naresid method; see definition of sandwich:::estfun.glm.)

benthestatistician commented 7 years ago

Checking in with this after a little hiatus. @markmfredrickson would it seem like a little thing or a big thing to take up this thread from here? (Perhaps the answer is in your comments earlier on this thread, but I couldn't clearly see.) If it's a little thing, then I'd support your trying to update the fillNAObjects branch and applying the fix you note to address the problem you named. The call I made two years ago for a survey of testing gaps shouldn't hold us back any more.

If it seems like more then a little thing to address then this issue and #39 can be tagged "no action", potentially also closed without resolution. (At least, without resolution other than noting the relevant limitations in fill.NAs documentation and/or adding warnings or error messages.) Anyway looking forward to hearing your thoughts, Mark.