Closed benthestatistician closed 9 years ago
@josherrickson, you've been looking at scores
recently. Can you think of any unwanted side-effects that would be introduced by automatically ignoring a subset
argument that had been used in fitting the model passed to scores
? I can't; the whole point of scores
is to generate predictions for all and only the observations represented in whatever model frame is being used to evaluate the formula it appears in. If you can't either, then we can switch over to thinking about how to address this issue (when resources permit).
So the issue here actually isn't the subsetting specifically. It's that pr
in subset
and t1
in weights
are not included in pgscore$formula
, and so are not included in the data.frame created inside scores
(which is built off the variables inside the formula). E.g., the following works (albeit foolishly obviously)
> data(nuclearplants)
> nuclearplants[1,1] <- NA
>
> pgscore <- lm(cost~cap + interaction(ct,bw), weights=t1, subset=(pr==0), data=nuclearplants)
> mod <- lm(pr ~ cap + scores(pgscore), data=nuclearplants)
Error in eval(expr, envir, enclos) : object 't1' not found
>
> pgscore2 <- lm(cost~cap + interaction(ct,bw), weights=cap, subset=(bw==0), data=nuclearplants)
> mod2 <- lm(pr ~ cap + scores(pgscore2), data=nuclearplants)
Warning message:
In predict.lm(newobj, newdata = alldata, ...) :
prediction from a rank-deficient fit may be misleading
> length(mod2$fitted.values)
[1] 32
(Note that if there's no missingness, scores
passes to predict
and both mod
and mod2
work.)
I don't think special casing subset
and weights
is necessarily the best way around this, as there are other arguments that could take variables not in the formula, such as offset
, and that doesn't even include more advanced modelling calls.
I'm looking into a way to extract all variables in a model call, but have been unsuccessful so far.
This is helpful, J: now I see what was behind the unit test, which I think I had somewhat misunderstood. Now that I understand it, though, I think the test is concerned with things we don't need to worry about.
weights
are used in the fitting of models only; predict
methods shouldn't have need for them. In other contexts subset
would have a different status, but in this context I think that it too can be regarded as something that pertains only to model fitting. Given that the user has invoked scores
, coefficients from model he has invoked it on should be applied to generate a score for each of the observations in data
, irrespective of which subsets may have been used in fitting. So we can do without both of these for the purpose of applying scores
.
Then there's the issue of offset
, which I hadn't been thinking about and which does seem to call for being passed along to predict
. (Thanks for pointing this out; it's just the sort of oversight I was hoping you'd alert me to.) The help pages for model.frame
and model.offset
have promising looking clues about how we might make sure to collate offsets along with the rest of the newdata
object we create.
As @josherrickson and I have been discussing related issues, I'd like to propose a simple resolution to this issue. if it looks OK to you, Josh, please implement and then close out the issue.
The resolution: if scores
craps out after getting into its branch concerned with refitting after imputing NAs, then we throw an informative error intended to help the user circumvent the limitations of get_all_vars
in our context. Specifically, we throw an error message suggesting to the user that she or he take care of the NAs in the data used to fit object
before invoking score
. Something like "I found NAs in model.frame( deparse substitute blah blah) but was unable to refit the model after imputing them. Try doing this manually, using fill.NAs() or another imputation routine, or restrict fit of deparse substitute blah blah to complete cases." (Here deparse substitute blah blah" is a gesture to code to grab the user's name of the object fed to scores
.)
Implemented in a4f2463.
Passing to Ben, please close or send back to me with updates to the actual error message produced.
Two comments:
object
argument in whatever way the user is referencing it? Users will appreciate this. That's what I meant to gesture to with "deparse substitute blah blah". One example of that, somewhat more complex than is seen here, is @nullsatz's function missing_x_msg
, in optmatch/R/utliities.R. stop
, how about the following, or similar: "Unable to address missingness in $OBJECT on the fly.\nTry dealing with NAs before the call to scores(), perhaps using fill.NAs()."
Thanks for the help here, @josherrickson.
Done.
> pgscore <- lm(cost~cap + interaction(ct,bw), weights=t1, subset=(pr==0), data=nuclearplants)
> lm(pr ~ cap + scores(pgscore), data=nuclearplants)
Error in scores(pgscore) :
Unable to address missingness in pgscore on the fly.
Try dealing with NAs before the call to scores(), perhaps using fill.NAs().
Terrific, thanks!
As of this writing, inst/tests/test.scores.R has
This is totally OK, but I think the ideally desirable behavior here would be to have
scores
soldier through without complaint, stripping off thepart of the call generating pgscore before updating it.
No rush about implementing this. Rather, I'd be interested to hear comments about potential side effects.