markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

`scores` should check its "object" argument for non-NULL "subset" specifications & overwrite them #73

Closed benthestatistician closed 9 years ago

benthestatistician commented 11 years ago

As of this writing, inst/tests/test.scores.R has

pgscore <- lm(cost~cap + interaction(ct,bw), weights=t1, 
subset=(pr==0), data=nuclearplants)  
# This doesn't work without the newdata argument.
  expect_error(l5 <- lm(pr ~ cap + scores(pgscore), data=nuclearplants))

This is totally OK, but I think the ideally desirable behavior here would be to have scores soldier through without complaint, stripping off the

subset=(pr==0)

part of the call generating pgscore before updating it.

No rush about implementing this. Rather, I'd be interested to hear comments about potential side effects.

benthestatistician commented 10 years ago

@josherrickson, you've been looking at scores recently. Can you think of any unwanted side-effects that would be introduced by automatically ignoring a subset argument that had been used in fitting the model passed to scores? I can't; the whole point of scores is to generate predictions for all and only the observations represented in whatever model frame is being used to evaluate the formula it appears in. If you can't either, then we can switch over to thinking about how to address this issue (when resources permit).

josherrickson commented 10 years ago

So the issue here actually isn't the subsetting specifically. It's that pr in subset and t1 in weights are not included in pgscore$formula, and so are not included in the data.frame created inside scores (which is built off the variables inside the formula). E.g., the following works (albeit foolishly obviously)

> data(nuclearplants)
> nuclearplants[1,1] <- NA
> 
> pgscore <- lm(cost~cap + interaction(ct,bw), weights=t1, subset=(pr==0), data=nuclearplants)
> mod <- lm(pr ~ cap + scores(pgscore), data=nuclearplants)
Error in eval(expr, envir, enclos) : object 't1' not found
> 
> pgscore2 <- lm(cost~cap + interaction(ct,bw), weights=cap, subset=(bw==0), data=nuclearplants)
> mod2 <- lm(pr ~ cap + scores(pgscore2), data=nuclearplants)
Warning message:
In predict.lm(newobj, newdata = alldata, ...) :
  prediction from a rank-deficient fit may be misleading
> length(mod2$fitted.values)
[1] 32

(Note that if there's no missingness, scores passes to predict and both mod and mod2 work.)

I don't think special casing subset and weights is necessarily the best way around this, as there are other arguments that could take variables not in the formula, such as offset, and that doesn't even include more advanced modelling calls.

I'm looking into a way to extract all variables in a model call, but have been unsuccessful so far.

benthestatistician commented 10 years ago

This is helpful, J: now I see what was behind the unit test, which I think I had somewhat misunderstood. Now that I understand it, though, I think the test is concerned with things we don't need to worry about.

weights are used in the fitting of models only; predict methods shouldn't have need for them. In other contexts subset would have a different status, but in this context I think that it too can be regarded as something that pertains only to model fitting. Given that the user has invoked scores, coefficients from model he has invoked it on should be applied to generate a score for each of the observations in data, irrespective of which subsets may have been used in fitting. So we can do without both of these for the purpose of applying scores.

Then there's the issue of offset, which I hadn't been thinking about and which does seem to call for being passed along to predict. (Thanks for pointing this out; it's just the sort of oversight I was hoping you'd alert me to.) The help pages for model.frame and model.offset have promising looking clues about how we might make sure to collate offsets along with the rest of the newdata object we create.

benthestatistician commented 9 years ago

As @josherrickson and I have been discussing related issues, I'd like to propose a simple resolution to this issue. if it looks OK to you, Josh, please implement and then close out the issue.

The resolution: if scores craps out after getting into its branch concerned with refitting after imputing NAs, then we throw an informative error intended to help the user circumvent the limitations of get_all_vars in our context. Specifically, we throw an error message suggesting to the user that she or he take care of the NAs in the data used to fit object before invoking score. Something like "I found NAs in model.frame( deparse substitute blah blah) but was unable to refit the model after imputing them. Try doing this manually, using fill.NAs() or another imputation routine, or restrict fit of deparse substitute blah blah to complete cases." (Here deparse substitute blah blah" is a gesture to code to grab the user's name of the object fed to scores.)

josherrickson commented 9 years ago

Implemented in a4f2463.

Passing to Ben, please close or send back to me with updates to the actual error message produced.

benthestatistician commented 9 years ago

Two comments:

  1. Could we arrange to have the error message reference the object argument in whatever way the user is referencing it? Users will appreciate this. That's what I meant to gesture to with "deparse substitute blah blah". One example of that, somewhat more complex than is seen here, is @nullsatz's function missing_x_msg, in optmatch/R/utliities.R.
  2. For the text in the stop, how about the following, or similar:

"Unable to address missingness in $OBJECT on the fly.\nTry dealing with NAs before the call to scores(), perhaps using fill.NAs()."

Thanks for the help here, @josherrickson.

josherrickson commented 9 years ago

Done.

> pgscore <- lm(cost~cap + interaction(ct,bw), weights=t1, subset=(pr==0), data=nuclearplants)
> lm(pr ~ cap + scores(pgscore), data=nuclearplants)
Error in scores(pgscore) :
  Unable to address missingness in pgscore on the fly.
Try dealing with NAs before the call to scores(), perhaps using fill.NAs().
benthestatistician commented 9 years ago

Terrific, thanks!