In `match_on.glm` interpret use of `strata` as call for exact matching

benthestatistician commented 9 years ago

In glm formulas one can wrap a nominal independent variable in strata, without changing the result:

> aglm2 <- glm(pr~.-cost-ne+strata(ne), data=nuclearplants, family=binomial)
> aglm <- glm(pr~.-cost, data=nuclearplants, family=binomial)
> all.equal(predict(aglm), predict(aglm2))
[1] TRUE

As I understand it, the purpose of the strata command is purely to leave signals for methods and functions that will operate on the fitted glm. Let's use it as a signal that the user wanted her propensity score matching to incorporate exact matching on the argument of strata.

(Taken together with the feature requested in #86, this would relieve nearly all end users of having to call exactMatch directly, or having to muck around directly with ISMs, in order to subdivide matching problems.)

josherrickson commented 9 years ago

strata isn't a function in base R. I see a version of it in the survival package, is that the one you're referring to?

benthestatistician commented 9 years ago

Yep, that's the one I meant. As the survival package is recommended, I'd think we could assume in general that it's installed, if not necessarily loaded.

josherrickson commented 9 years ago

This has been implemented. See test.fullmatch.R#L409 for an example.

benthestatistician commented 9 years ago

This is great stuff! But if I'm reading the example correctly, it demonstrates calls to strata within match_on's formula method, not the glm. I.e., it solves the next problem I was going to ask someone to work on ;-)

It may be that I'm just missing the test for uses of strata within a PS model, in which case please direct me to it...

josherrickson commented 9 years ago

My mistake, I thought the gesture towards glm above was only in how it implemented strata.

I'm not sure I follow how this should work in this case. If the user doesn't load the survival package, glm won't take a strata argument and I don't think we want to get into overloading glm. If we do add a dependency then the model that is fit is inconsistent.

E.g., if I wanted to exactMatch on ne and my propensity model is pr ~ t1, then the following two lines should be identical

match_on(glm(pr ~ t1, data=nuclearplants), within=exactMatch(pr ~ ne, data=nuclearplants), data=nuclearplants)
match_on(glm(pr ~ t1 + strata(ne), data=nuclearplants), data=nuclearplants)

However, this will NOT be the case, as glm(pr ~ t1 + strata(ne)) is equivalent to glm(pr ~ t1 + ne), and NOT to glm(pr ~ t1).

> mod1 <- glm(pr ~ t1 + strata(ne), data=nuclearplants)
> mod2 <- glm(pr ~ t1 + ne, data=nuclearplants)
> mod3 <- glm(pr ~ t1, data=nuclearplants)
> identical(predict(mod1), predict(mod2))
[1] TRUE
> identical(predict(mod1), predict(mod3))
[1] FALSE

benthestatistician commented 9 years ago

@josherrickson, I think we should be OK with the non-equivalence in your example. It's a best practice, if not necessarily a well-documented one, to add fixed effects to the PS model for whatever variables you're going to exact match on. (You tend to wind up with better balance that way.) The user really ought to be fitting mod1 or mod2 if she intends eventially to stratify on ne.

I also agree about not overloading glm. As the survival package is a part of the base R distribution, I can't think of much downside in adding a dependency to it. (Well, I can see a cost: we ought to add tests into our package that strata is behaving as we expect it to, so that we'll know if the survival people decide to change it.)

For ease of reference, and in case it's helpful, my example of pulling out the strata terms from a glm formula is here.

josherrickson commented 9 years ago

So, in match_on.glm, these two should be equivalent:

match_on(glm(pr ~ t1 + ne, data=nuclearplants), within=exactMatch(pr ~ ne, data=nuclearplants), data=nuclearplants)
match_on(glm(pr ~ t1 + strata(ne), data=nuclearplants), data=nuclearplants)

Should the same hold true in match_on.formula? E.g., for the following:

match_on(pr ~ t1 + strata(ne), data=nuclearplants)

Which of these should it be equivalent to?

mod_form1 <- match_on(pr ~ t1, within=exactMatch(pr ~ ne, data=nuclearplants), data=nuclearplants)
mod_form2 <- match_on(pr ~ t1 + ne, within=exactMatch(pr ~ ne, data=nuclearplants), data=nuclearplants)

I believe it should be mod_form1 since this isn't a PS match.

benthestatistician commented 9 years ago

I agree: in this instance and others like it the mod_form1 interpretation is best.

On Wed, Jun 17, 2015 at 9:36 AM, josherrickson notifications@github.com wrote:

So, in match_on.glm, these two should be equivalent:

match_on(glm(pr ~ t1 + ne, data=nuclearplants), within=exactMatch(pr ~ ne, data=nuclearplants), data=nuclearplants) match_on(glm(pr ~ t1 + strata(ne), data=nuclearplants), data=nuclearplants)

Should the same hold true in match_on.formula? E.g., for the following:

match_on(pr ~ t1 + strata(ne), data=nuclearplants)

Which of these should it be equivalent to?

mod_form1 <- match_on(pr ~ t1, within=exactMatch(pr ~ ne, data=nuclearplants), data=nuclearplants) mod_form2 <- match_on(pr ~ t1 + ne, within=exactMatch(pr ~ ne, data=nuclearplants), data=nuclearplants)

I believe it should be mod_form1 since this isn't a PS match.

— Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/optmatch/issues/101#issuecomment-112801562 .

josherrickson commented 9 years ago

@benthestatistician Take a look at test.fullmatch.R#L442.

Sorry this took so long; this actual change was fairly trivial given what I had already for match_on.formula, but it exposed a bug in scores (it wasn't properly handling a*b terms) that took a while to overcome - but on the plus-side, I came up with a greatly simplified version of scores.

benthestatistician commented 9 years ago

Nice work! I added a minor change to the error handling when model.frame with na.action=na.pass doesn't work, intended to avoid a spurious warning in the event that model.frame won't go even without the special na.action argument. Also reworded that warning a little. Change appears not to break anything, but feel free to take a look [master a5f20a7].

In the meantime I'll go ahead and close this out; reopen if needed.

markmfredrickson commented 9 years ago

I'm getting an error on the master branch on the tests Josh links to above. Reopening.

markmfredrickson commented 9 years ago

Oops. My mistake. I refactored some code that caused this error. Fixed.

markmfredrickson / optmatch

In `match_on.glm` interpret use of `strata` as call for exact matching #101