markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

update `fill.NAs` semantics re `scores`, NA handling #103

Closed benthestatistician closed 8 years ago

benthestatistician commented 9 years ago

While we're making adjustments to fill.NAs (cf #100), let's also enable some special handling of strata terms in the calling argument, to match functionality and semantics we're committing to elsewhere (e.g. #87).

  1. In contrast to other terms on the RHS of the formula arg, NAs in a strata term should be interpreted as instructions to leave the designated units out of the match, and passed through. (As with NAs on the LHS are.)
  2. Therefore no need to expand strata factors into a dummy variable representation; pass them through as-is. (I.e., no columns of returned data frame labeled "strata"; instead, if the formula contained strata(eyecolor) then it just has an eyecolor column, and if it's strata(taken + passed) then there are columns taken and passed.)
  3. In the formula associated with that data frame, ie its "terms" attribute, do retain strata calls. So that functions applied to the result of fill.NAs will be able to recognize strata terms. In addition, use the "terms" attribute of the returned model frame to flag them as specials, specifically "strata" specials. (Related code, if not necessarily model code, appears in my ppse.glm implementation, at least as of this commit.)
benthestatistician commented 9 years ago

Items listed in the issue statement were a part of a longer wishlist for this function. Below are the segments that didn't make it into the issue. Note that the last of them could be added on its own independently of the other two.

  1. Mean-impute NAs within strata, not across them. (If useful to the EE, add option to mean-impute within a stratum to the treatment group mean; cf GL1605)
  2. This formula/"terms" attribute should in turn specify that strata factor, or the interaction of all strata factors, is to be interacted separately with each NA flag variable.
  3. While we're at it, add a bit of logic to the production of NA flagging variables in order to flag and remove NA flag columns that are duplicates of previous NA flag columns.
benthestatistician commented 9 years ago

Noting here that explanation of subtask 3 discusses the "formula associated with" the return value data frame, but that's oversimplified. If that return value is outdf, what I'm referring to here is formula(terms(outdf)), not formula(outdf): starting with an np.filled as in the fill.NAs examples,

> np.filled <- fill.NAs(np.missing)
> attr(np.filled, "terms") <- terms(pr ~ t1*t2)
> formula(np.filled)
cost ~ pr + t1 + t2 + ne + ct + cum.n + t1.NA + t2.NA + ne.NA + 
    ct.NA + cum.n.NA
> formula(terms(np.filled))
pr ~ t1 * t2

Our present assumption is that fill.NAs will spit out a data frame, outdf, with the property that (formula(outdf), outdf) are a suitable (formula, data) argument pair to be passed to glm (bayesglm, etc). This requires that the first column of outdf encode the dependent variable and that the remaining columns all encode independent variables. I am suggesting that we migrate toward replacing this with an assumption that (formula(terms(outdf)), outdf) are a suitable (formula, data) argument pair, Then we get to use the terms attribute of outdf to record what we want the end model to look like, and thus be more flexible about what's passed in outdf.

Fortunately, internally glm already seems to do something closer to formula(terms()) than formula(). . Continuing the above,

> glm(np.filled)

Call:  glm(formula = np.filled)

Coefficients:
(Intercept)           t1           t2        t1:t2  
  -375.0267      61.1973       8.8602      -0.6514  

as opposed to

> glm(formula(np.filled), data=np.filled)

Call:  glm(formula = formula(np.filled), data = np.filled)

Coefficients:
 (Intercept)            pr            t1            t2            ne  
    -161.367       -44.233        25.727         3.894       132.300  
          ct         cum.n     t1.NATRUE     t2.NATRUE     ne.NATRUE  
      85.651        -1.196       199.761       -84.758        13.292  
   ct.NATRUE  cum.n.NATRUE  
     -69.368       -82.801 
markmfredrickson commented 9 years ago

I've implemented everything but comment 2, items 2 and 3. I've tried attaching a terms attribute the result fill.NAs, but then I get the following error when trying to build a model frame:

  set.seed(20150624)

  data.full <- data.frame(z = c(rep(1, 10), rep(0, 10)),
                          x = rnorm(20),
                          s = sample(c("A", "B", "C"), size = 20, replace = TRUE),
                          t = sample(c("UP", "DOWN"), size = 20, replace = TRUE))
  data.full$x[c(1, 2, 11)] <- NA

  # basic strata handling without NAs
  res1 <- fill.NAs(z ~ x + strata(s), data = data.full) 

# interactive session:
> glm(res1)
Error in model.matrix.default(mt, mf, contrasts) : 
  model frame and formula mismatch in model.matrix()
> model.matrix(res1)
Error in eval(expr, envir, enclos) : object 'z' not found

Terms objects have always been something of a black art to me. Any thoughts?

benthestatistician commented 9 years ago

Re terms problem: I notice that at fill.NAs.R#L22 we're using a formula of format <response> ~ . to define the terms object x. I suspect the formula inside of a terms attribute needs to be explicitly spelled out, rather than with a placeholder -- i.e., no dots as placeholders for variables on the RHS.

On Thu, Jun 25, 2015 at 4:28 PM, Mark Fredrickson notifications@github.com wrote:

I've implemented everything but comment 2, items 2 and 3. I've tried attaching a terms attribute the result fill.NAs, but then I get the following error when trying to build a model frame:

set.seed(20150624)

data.full <- data.frame(z = c(rep(1, 10), rep(0, 10)), x = rnorm(20), s = sample(c("A", "B", "C"), size = 20, replace = TRUE), t = sample(c("UP", "DOWN"), size = 20, replace = TRUE)) data.full$x[c(1, 2, 11)] <- NA

basic strata handling without NAs

res1 <- fill.NAs(z ~ x + strata(s), data = data.full)

interactive session:

glm(res1) Error in model.matrix.default(mt, mf, contrasts) : model frame and formula mismatch in model.matrix() model.matrix(res1) Error in eval(expr, envir, enclos) : object 'z' not found

Terms objects have always been something of a black art to me. Any thoughts?

— Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/optmatch/issues/103#issuecomment-115387851 .

markmfredrickson commented 9 years ago

Cross linking to #104, as the solution might be the same: namely, returning a different type of object from fill.NAs.

benthestatistician commented 9 years ago

If I'm correctly interpreting Mark's progress report of a week ago, this is implemented:

Mean-impute NAs within strata, not across them. (If useful to the EE, add option to mean-impute within a stratum to the treatment group mean; cf GL1605)

...while this isn't:

This formula/"terms" attribute should in turn specify that the strata factor, or the interaction of all strata factors, is to be interacted separately with each NA flag variable.

I think these two features should be pushed to optmatch together. If one's ready but the other isn't, we might put the one that's ready in optmatchExperimental and wait to release them both to optmatch.

benthestatistician commented 9 years ago

I'm afraid we've introduced a bug into version 0.9-5 with our hacking on the terms object, namely that a fill.NAs usage we feature in the help page fails with factor variables:

> glm(fill.NAs(pr~cap+factor(ne), data=nuclearplants), family=binomial)
Error in model.matrix.default(mt, mf, contrasts) : 
  model frame and formula mismatch in model.matrix()

the problem being that we simultaneously have

formula( fill.NAs(pr~cap+factor(ne), data=nuclearplants) )
pr ~ cap + factor(ne)

and

colnames( fill.NAs(pr~cap+factor(ne), data=nuclearplants) )
[1] "pr"          "cap"         "factor(ne)1"

In the immediate term, it would fix the problem either to circumvent the factor expansion in the data frame created by fill.NAs or to adjust the terms object being attached to that data frame to match it, ie in this case arranging things so that formula( fill.NAs(...) ) returns pr ~ cap + factor(ne)1. @markmfredrickson will have better instincts about which of these makes more sense, and about the utility of patching this up as opposed to moving straight to the change discussed in comments to #104.

josherrickson commented 8 years ago

Added a test and addressed the bug in f125294.

It introduced a failing test related to the specials attribute of the terms of a fill.NA'd object, but otherwise runs the strata bits fine.

benthestatistician commented 8 years ago

Good to have that bug squashed. I'd like to have the specials attribute fixed up too, but it's not urgent. @josherrickson, does the failing test reflect that you worked on it and couldn't get it to go, or just that this was your stopping point (a reasonable one) for the time being?

josherrickson commented 8 years ago

The latter more than the former. I looked at it for a few minutes, but since it wasn't show-stopping in most cases, I de-prioritized.

benthestatistician commented 8 years ago

Which test is it that was failing for you, @josherrickson ? The only one that I could find is in test.fill.NAs.R (master branch, at f125294) and it passes for me:

>   tt <- terms(res1)
>   expect_equal(attr(tt, "specials")$strata, 3) # variable in 3rd position in formula
> attr(tt, "specials")
$strata
[1] 3

(My working copy isn't presently passing make test, but that appears to have to do with something in the utility functions testing suite or just afterwards.) Could you cross-check that your test failure wasn't some sort of passing fluke, @josherrickson ?

josherrickson commented 8 years ago

That's the test that fails for me.

>   expect_equal(attr(tt, "specials")$strata, 3) # variable in 3rd position in formula
Error: attr(tt, "specials")$strata not equal to 3
target is NULL, current is numeric
> attr(tt, "specials")
$strata
NULL

This fails in make check, make test or running make interactive and then running the test manually.

I just checked with a clean checkout of optmatch and get the same issue.

What does "something in the utility functions testing suite or just afterwards" mean?

benthestatistician commented 8 years ago

My bad - now I'm seeing a failure at the same point. (Still getting accustomed to the devtools-based workflow here.)

On Sat, Dec 19, 2015 at 5:50 AM, Josh Errickson notifications@github.com wrote:

That's the test that fails for me.

expect_equal(attr(tt, "specials")$strata, 3) # variable in 3rd position in formulaError: attr(tt, "specials")$strata not equal to 3target is NULL, current is numeric> attr(tt, "specials")$strataNULL

This fails in make check, make test or running make-interactive and then running the test manually.

I just checked with a clean checkout of optmatch and get the same issue.

What does "something in the utility functions testing suite or just afterwards" mean?

— Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/optmatch/issues/103#issuecomment-165976267 .

benthestatistician commented 8 years ago

Removed "bug" label from this issue, as Josh fixed the bug piece w/ f125294. The one remaining open bit, about specials, is more nearly a feature than a bug. (somewhere in between, really -- to be elaborated in a separate comment.)

benthestatistician commented 8 years ago

The test.fill.NAs.R check that's currently failing is actually pointing to another issue, which would create undesirable and unexpected behavior when fill.NAs is combined w/ strata. I'm hoping to hear from @markmfredrickson about it while plotting appropriate remedies.

the present state of affairs in the master branch is that when the user includes a strata call in a formula sent to fill.NAs, it's separated out and processed by findStrata, but never tacked back on to the formula/terms structure that gets associated with the filled out data frame. (It does get pushed through to the filled in data frame, but since it's not a part of the associated formula it'll be left out of PS models that get fit using the fill.NAs result.) I suspect that this is a simple oversight, but the code immediately following the findStrata call, which was authored by Mark back in June, seems almost intentionally to leave that out. Mark, any insight here?

benthestatistician commented 8 years ago

In ad3c67f, I was able to make the error go away and (I thought) wrap things up for this issue. But then I noticed that glm errored out when I gave it the the result of a fill.NAs call. As I observed it, the problem seemed to be specific to situations where we've got NAs in a variables we call strata() on. In 802c76c and fdbe1a4, I add tests for this and then comment them out, as I didn't immediately have a fix.

I suspect the remaining problem isn't too hard to solve, but I ran out of time for working on it. I did narrow it down to the call to model.matrix, and this is reflected in the tests that I added in the aforementioned commits. Once this worked out, and the testing suite elaborated just a little bit to ensure that whatever fill.NAs spits out is compatible w/ passing to glm, this issue should be ready for closing.

benthestatistician commented 8 years ago

@josherrickson , did you get a chance to look at the problem I noted on this thread back on Dec 23? Given the date it was posted, I wonder if it slipped between the cracks.

josherrickson commented 8 years ago

Sorry, it did.

So, to summarize/clarify the issue, there's a mismatch between the data.frame from a call to fill.NAs,

> names(res3)
[1] "z"    "x"    "x.NA" "s"

and the associated terms object:

> attr(terms(res3), "variables")
list(z, x, x.NA, strata(s))

model.matrix assumes that these are the same. I looked at survival's strata a bit for inspiration; they have a version of model.matrix.coxph (for example) that just drops the strata

> c <- coxph(Surv(futime, fustat) ~ age + strata(rx), data=ovarian) 
> attr(terms(c), "variables")
list(Surv(futime, fustat), age, strata(rx))
> head(model.matrix(c), 3)
      age
1 72.3315
2 74.4932
3 66.4658

so there's no help there.

Its pretty easy to adjust one of the two above (e.g. either make fill.NAs name its column "strata(s)", or revert some of the changes in this issue and make the terms not identify strata) but I'm not close enough to this issue to be able to think through all the ramifications of either. I suspect the former rather than the latter, but would appreciate further input.

josherrickson commented 8 years ago

Second comment to not conflate these two thoughts:

I believe @markmfredrickson's work in the branch (discussed in #104) to make fill.NAs a class would address this issue nicely, as even if it doesn't directly solve it, it'd be easy to write a wrapper function model.matrix.fill.NAs [sidenote: will that extra '.' cause issues? Will R look for a version of model.matrix.fill that supports class NAs? Or is it smart enough about this?] that makes one of the above two adjustments before passing it back to model.matrix.default.

@markmfredrickson Any indication on the timing of this? I'm hesitant to spend time on this right now if these changes are forthcoming shortly.

markmfredrickson commented 8 years ago

There's no particular time line on the objects work for fill.NAs, but if it makes this simpler, makes we can expedite it.

As for the the issue about using periods: the use of generic.class functions is just conventional. You still need to indicate which functions are S3 methods in the NAMESPACE file. In other words, feel free to call it what you will.

benthestatistician commented 8 years ago

I'd be interested to hear whether the first of @josherrickson's two easy adjustments (make fill.NAs name its column "strata(s)") breaks existing tests. If not, then I think I'd be happy to with it (while we wait for the work on #104/fillNAObject branch to resume).

Although I think it'd be a little neater on the whole to have model frame columns with names like "s" rather than "strata(s)", but if it doesn't break calls like glm(res4, family=binomial) then I can live with it.

josherrickson commented 8 years ago

While it did address the direct issue, changing the name of the returned column to strata(s) did break the glm.

One approach that does address this is to not rely upon glm to extract the formula from the data.frame (res3 in the above examples).

For example,

> glm(res3, family=binomial)
Error in model.matrix.default(mt, mf, contrasts) :
  model frame and formula mismatch in model.matrix()
> glm(formula(terms(res3)), data=res3, family=binomial)

Call:  glm(formula = formula(terms(res3)), family = binomial, data = res3)

Coefficients:
(Intercept)            x     x.NATRUE   strata(s)B   strata(s)C
    -1.2826      -0.7250       0.9034       0.7290       3.1954

Degrees of Freedom: 16 Total (i.e. Null);  12 Residual
  (3 observations deleted due to missingness)
Null Deviance:      23.51
Residual Deviance: 18.8   AIC: 28.8

Unfortunately, there's no way I can see to force this to occur with the glm(data.frame) call. We'd have to document it. Alternatively again, if fill.NAs was a class, it could be an attribute of this class.

Finally, one way it could be fixed would be to manipulate the result of formula(res3). Looking at the code of stats:::formula.data.frame, I don't see a way (I was hoping something like attr(res3, "formula") <- z ~ x + x.NA + strata(s) would work). If anyone has a suggestion of how to make this happen, it would fix it.

josherrickson commented 8 years ago

Thinking about this a bit further, the method of calling glm that's causing this issue has always seemed a bit odd to me. glm(res3) is passing a data.frame to glm. While I realize this is (generally) possible, I'd be much more likely to write something like

> glm(z ~ x + x.NA + strata(s), data=res3, family=binomial)

which runs just fine (and is equivalent to the working code in my previous comment, which is what started me thinking down this route). In fact, in the example of ?fill.NAs, we demonstrate that the two are identical. Is requiring that the user do this a bad solution? If not generally, at least in situations where clusters are involved?

benthestatistician commented 8 years ago

From my perspective it seems like a fine solution to recommend that users do one of

glm(formula(res3), data=res3, family=binomial)

or

glm(z ~ x + x.NA + strata(s), data=res3, family=binomial)

, while updating documentation appropriately. I have no reservations about that.

@markmfredrickson, do pipe up if you have thoughts on this If memory serves, it was originally your suggestion to use the shorter, two-argument call glm(res3, family=binomial).

josherrickson commented 8 years ago

I've updated the documentation & examples.

markmfredrickson commented 8 years ago

I'm also happy with the change. The reason for the two argument version was to allow direct calls like glm(fill.NAs(z ~ x1 + x2, data = data), family = binomial). With the current issues, this may not be possible any longer, but that might be a small price to pay. If the "fill.NAs as objects" branch moves along, we may be able to recapture the simpler version.

On Thu, Mar 17, 2016 at 10:56 AM, Josh Errickson notifications@github.com wrote:

I've updated the documentation & examples.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/optmatch/issues/103#issuecomment-197945562

benthestatistician commented 8 years ago

I think we've all signed off now. @josherrickson, please don't hold off making any remaining changes in order to close this out.

benthestatistician commented 8 years ago

Re @markmfredrickson's comments: Now that (if I'm not mistaken) scores and match_on.glm/match_on.bigglm et al. scan their first argument for NAs, applying fill.NAs() and then refitting in the event that NAs are found, greasing the wheels for users to invoke fill.NAs directly feels a little less pressing.

That said, I'm also eager to see what comes of the the fill.NAs as objects -branch, for this and other reasons also.

josherrickson commented 8 years ago

I believe this issue to be closed then. The call to glm(fill.NA's... may or may not work, but I changed all documentation/examples that I could find to the new version.