Closed benthestatistician closed 8 years ago
Items listed in the issue statement were a part of a longer wishlist for this function. Below are the segments that didn't make it into the issue. Note that the last of them could be added on its own independently of the other two.
strata
factor, or the interaction of all strata
factors, is to be interacted separately with each NA flag variable.Noting here that explanation of subtask 3 discusses the "formula associated with" the return value data frame, but that's oversimplified. If that return value is outdf
, what I'm referring to here is formula(terms(outdf))
, not formula(outdf)
: starting with an np.filled
as in the fill.NAs
examples,
> np.filled <- fill.NAs(np.missing)
> attr(np.filled, "terms") <- terms(pr ~ t1*t2)
> formula(np.filled)
cost ~ pr + t1 + t2 + ne + ct + cum.n + t1.NA + t2.NA + ne.NA +
ct.NA + cum.n.NA
> formula(terms(np.filled))
pr ~ t1 * t2
Our present assumption is that fill.NAs
will spit out a data frame, outdf
, with the property that (formula(outdf)
, outdf
) are a suitable (formula, data) argument pair to be passed to glm
(bayesglm
, etc). This requires that the first column of outdf
encode the dependent variable and that the remaining columns all encode independent variables. I am suggesting that we migrate toward replacing this with an assumption that (formula(terms(outdf))
, outdf
) are a suitable (formula, data) argument pair, Then we get to use the terms
attribute of outdf
to record what we want the end model to look like, and thus be more flexible about what's passed in outdf
.
Fortunately, internally glm
already seems to do something closer to formula(terms())
than formula()
. . Continuing the above,
> glm(np.filled)
Call: glm(formula = np.filled)
Coefficients:
(Intercept) t1 t2 t1:t2
-375.0267 61.1973 8.8602 -0.6514
as opposed to
> glm(formula(np.filled), data=np.filled)
Call: glm(formula = formula(np.filled), data = np.filled)
Coefficients:
(Intercept) pr t1 t2 ne
-161.367 -44.233 25.727 3.894 132.300
ct cum.n t1.NATRUE t2.NATRUE ne.NATRUE
85.651 -1.196 199.761 -84.758 13.292
ct.NATRUE cum.n.NATRUE
-69.368 -82.801
I've implemented everything but comment 2, items 2 and 3. I've tried attaching a terms
attribute the result fill.NAs
, but then I get the following error when trying to build a model frame:
set.seed(20150624)
data.full <- data.frame(z = c(rep(1, 10), rep(0, 10)),
x = rnorm(20),
s = sample(c("A", "B", "C"), size = 20, replace = TRUE),
t = sample(c("UP", "DOWN"), size = 20, replace = TRUE))
data.full$x[c(1, 2, 11)] <- NA
# basic strata handling without NAs
res1 <- fill.NAs(z ~ x + strata(s), data = data.full)
# interactive session:
> glm(res1)
Error in model.matrix.default(mt, mf, contrasts) :
model frame and formula mismatch in model.matrix()
> model.matrix(res1)
Error in eval(expr, envir, enclos) : object 'z' not found
Terms objects have always been something of a black art to me. Any thoughts?
Re terms
problem: I notice that at fill.NAs.R#L22 we're using a formula
of format <response> ~ .
to define the terms object x
. I suspect the
formula inside of a terms
attribute needs to be explicitly spelled out,
rather than with a placeholder -- i.e., no dots as placeholders for
variables on the RHS.
On Thu, Jun 25, 2015 at 4:28 PM, Mark Fredrickson notifications@github.com wrote:
I've implemented everything but comment 2, items 2 and 3. I've tried attaching a terms attribute the result fill.NAs, but then I get the following error when trying to build a model frame:
set.seed(20150624)
data.full <- data.frame(z = c(rep(1, 10), rep(0, 10)), x = rnorm(20), s = sample(c("A", "B", "C"), size = 20, replace = TRUE), t = sample(c("UP", "DOWN"), size = 20, replace = TRUE)) data.full$x[c(1, 2, 11)] <- NA
basic strata handling without NAs
res1 <- fill.NAs(z ~ x + strata(s), data = data.full)
interactive session:
glm(res1) Error in model.matrix.default(mt, mf, contrasts) : model frame and formula mismatch in model.matrix() model.matrix(res1) Error in eval(expr, envir, enclos) : object 'z' not found
Terms objects have always been something of a black art to me. Any thoughts?
— Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/optmatch/issues/103#issuecomment-115387851 .
Cross linking to #104, as the solution might be the same: namely, returning a different type of object from fill.NAs
.
If I'm correctly interpreting Mark's progress report of a week ago, this is implemented:
Mean-impute NAs within strata, not across them. (If useful to the EE, add option to mean-impute within a stratum to the treatment group mean; cf GL1605)
...while this isn't:
This formula/"terms" attribute should in turn specify that the strata factor, or the interaction of all strata factors, is to be interacted separately with each NA flag variable.
I think these two features should be pushed to optmatch together. If one's ready but the other isn't, we might put the one that's ready in optmatchExperimental and wait to release them both to optmatch.
I'm afraid we've introduced a bug into version 0.9-5 with our hacking on the terms object, namely that a fill.NAs
usage we feature in the help page fails with factor variables:
> glm(fill.NAs(pr~cap+factor(ne), data=nuclearplants), family=binomial)
Error in model.matrix.default(mt, mf, contrasts) :
model frame and formula mismatch in model.matrix()
the problem being that we simultaneously have
formula( fill.NAs(pr~cap+factor(ne), data=nuclearplants) )
pr ~ cap + factor(ne)
and
colnames( fill.NAs(pr~cap+factor(ne), data=nuclearplants) )
[1] "pr" "cap" "factor(ne)1"
In the immediate term, it would fix the problem either to circumvent the factor expansion in the data frame created by fill.NAs
or to adjust the terms object being attached to that data frame to match it, ie in this case arranging things so that formula( fill.NAs(...) )
returns pr ~ cap + factor(ne)1
. @markmfredrickson will have better instincts about which of these makes more sense, and about the utility of patching this up as opposed to moving straight to the change discussed in comments to #104.
Added a test and addressed the bug in f125294.
It introduced a failing test related to the specials
attribute of the terms of a fill.NA'd object, but otherwise runs the strata bits fine.
Good to have that bug squashed. I'd like to have the specials attribute fixed up too, but it's not urgent. @josherrickson, does the failing test reflect that you worked on it and couldn't get it to go, or just that this was your stopping point (a reasonable one) for the time being?
The latter more than the former. I looked at it for a few minutes, but since it wasn't show-stopping in most cases, I de-prioritized.
Which test is it that was failing for you, @josherrickson ? The only one that I could find is in test.fill.NAs.R
(master branch, at f125294) and it passes for me:
> tt <- terms(res1)
> expect_equal(attr(tt, "specials")$strata, 3) # variable in 3rd position in formula
> attr(tt, "specials")
$strata
[1] 3
(My working copy isn't presently passing make test, but that appears to have to do with something in the utility functions testing suite or just afterwards.) Could you cross-check that your test failure wasn't some sort of passing fluke, @josherrickson ?
That's the test that fails for me.
> expect_equal(attr(tt, "specials")$strata, 3) # variable in 3rd position in formula
Error: attr(tt, "specials")$strata not equal to 3
target is NULL, current is numeric
> attr(tt, "specials")
$strata
NULL
This fails in make check
, make test
or running make interactive
and then running the test manually.
I just checked with a clean checkout of optmatch and get the same issue.
What does "something in the utility functions testing suite or just afterwards" mean?
My bad - now I'm seeing a failure at the same point. (Still getting accustomed to the devtools-based workflow here.)
On Sat, Dec 19, 2015 at 5:50 AM, Josh Errickson notifications@github.com wrote:
That's the test that fails for me.
expect_equal(attr(tt, "specials")$strata, 3) # variable in 3rd position in formulaError: attr(tt, "specials")$strata not equal to 3target is NULL, current is numeric> attr(tt, "specials")$strataNULL
This fails in make check, make test or running make-interactive and then running the test manually.
I just checked with a clean checkout of optmatch and get the same issue.
What does "something in the utility functions testing suite or just afterwards" mean?
— Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/optmatch/issues/103#issuecomment-165976267 .
Removed "bug" label from this issue, as Josh fixed the bug piece w/ f125294. The one remaining open bit, about specials, is more nearly a feature than a bug. (somewhere in between, really -- to be elaborated in a separate comment.)
The test.fill.NAs.R
check that's currently failing is actually pointing to another issue, which would create undesirable and unexpected behavior when fill.NAs
is combined w/ strata
. I'm hoping to hear from @markmfredrickson about it while plotting appropriate remedies.
the present state of affairs in the master branch is that when the user includes a strata
call in a formula sent to fill.NAs
, it's separated out and processed by findStrata
, but never tacked back on to the formula/terms structure that gets associated with the filled out data frame. (It does get pushed through to the filled in data frame, but since it's not a part of the associated formula it'll be left out of PS models that get fit using the fill.NAs
result.) I suspect that this is a simple oversight, but the code immediately following the findStrata
call, which was authored by Mark back in June, seems almost intentionally to leave that out. Mark, any insight here?
In ad3c67f, I was able to make the error go away and (I thought) wrap things up for this issue. But then I noticed that glm
errored out when I gave it the the result of a fill.NAs
call. As I observed it, the problem seemed to be specific to situations where we've got NAs in a variables we call strata()
on. In 802c76c and fdbe1a4, I add tests for this and then comment them out, as I didn't immediately have a fix.
I suspect the remaining problem isn't too hard to solve, but I ran out of time for working on it. I did narrow it down to the call to model.matrix
, and this is reflected in the tests that I added in the aforementioned commits. Once this worked out, and the testing suite elaborated just a little bit to ensure that whatever fill.NAs
spits out is compatible w/ passing to glm
, this issue should be ready for closing.
@josherrickson , did you get a chance to look at the problem I noted on this thread back on Dec 23? Given the date it was posted, I wonder if it slipped between the cracks.
Sorry, it did.
So, to summarize/clarify the issue, there's a mismatch between the data.frame from a call to fill.NAs
,
> names(res3)
[1] "z" "x" "x.NA" "s"
and the associated terms object:
> attr(terms(res3), "variables")
list(z, x, x.NA, strata(s))
model.matrix
assumes that these are the same. I looked at survival's strata
a bit for inspiration; they have a version of model.matrix.coxph
(for example) that just drops the strata
> c <- coxph(Surv(futime, fustat) ~ age + strata(rx), data=ovarian)
> attr(terms(c), "variables")
list(Surv(futime, fustat), age, strata(rx))
> head(model.matrix(c), 3)
age
1 72.3315
2 74.4932
3 66.4658
so there's no help there.
Its pretty easy to adjust one of the two above (e.g. either make fill.NAs
name its column "strata(s)", or revert some of the changes in this issue and make the terms
not identify strata) but I'm not close enough to this issue to be able to think through all the ramifications of either. I suspect the former rather than the latter, but would appreciate further input.
Second comment to not conflate these two thoughts:
I believe @markmfredrickson's work in the branch (discussed in #104) to make fill.NAs
a class would address this issue nicely, as even if it doesn't directly solve it, it'd be easy to write a wrapper function model.matrix.fill.NAs
[sidenote: will that extra '.' cause issues? Will R look for a version of model.matrix.fill
that supports class NAs
? Or is it smart enough about this?] that makes one of the above two adjustments before passing it back to model.matrix.default
.
@markmfredrickson Any indication on the timing of this? I'm hesitant to spend time on this right now if these changes are forthcoming shortly.
There's no particular time line on the objects work for fill.NAs
, but if it makes this simpler, makes we can expedite it.
As for the the issue about using periods: the use of generic.class
functions is just conventional. You still need to indicate which functions are S3 methods in the NAMESPACE file. In other words, feel free to call it what you will.
I'd be interested to hear whether the first of @josherrickson's two easy adjustments (make fill.NAs name its column "strata(s)") breaks existing tests. If not, then I think I'd be happy to with it (while we wait for the work on #104/fillNAObject branch to resume).
Although I think it'd be a little neater on the whole to have model frame columns with names like "s" rather than "strata(s)", but if it doesn't break calls like glm(res4, family=binomial)
then I can live with it.
While it did address the direct issue, changing the name of the returned column to strata(s)
did break the glm.
One approach that does address this is to not rely upon glm
to extract the formula from the data.frame (res3
in the above examples).
For example,
> glm(res3, family=binomial)
Error in model.matrix.default(mt, mf, contrasts) :
model frame and formula mismatch in model.matrix()
> glm(formula(terms(res3)), data=res3, family=binomial)
Call: glm(formula = formula(terms(res3)), family = binomial, data = res3)
Coefficients:
(Intercept) x x.NATRUE strata(s)B strata(s)C
-1.2826 -0.7250 0.9034 0.7290 3.1954
Degrees of Freedom: 16 Total (i.e. Null); 12 Residual
(3 observations deleted due to missingness)
Null Deviance: 23.51
Residual Deviance: 18.8 AIC: 28.8
Unfortunately, there's no way I can see to force this to occur with the glm(data.frame)
call. We'd have to document it. Alternatively again, if fill.NAs was a class, it could be an attribute of this class.
Finally, one way it could be fixed would be to manipulate the result of formula(res3)
. Looking at the code of stats:::formula.data.frame
, I don't see a way (I was hoping something like attr(res3, "formula") <- z ~ x + x.NA + strata(s)
would work). If anyone has a suggestion of how to make this happen, it would fix it.
Thinking about this a bit further, the method of calling glm
that's causing this issue has always seemed a bit odd to me. glm(res3)
is passing a data.frame to glm
. While I realize this is (generally) possible, I'd be much more likely to write something like
> glm(z ~ x + x.NA + strata(s), data=res3, family=binomial)
which runs just fine (and is equivalent to the working code in my previous comment, which is what started me thinking down this route). In fact, in the example of ?fill.NAs
, we demonstrate that the two are identical. Is requiring that the user do this a bad solution? If not generally, at least in situations where clusters are involved?
From my perspective it seems like a fine solution to recommend that users do one of
glm(formula(res3), data=res3, family=binomial)
or
glm(z ~ x + x.NA + strata(s), data=res3, family=binomial)
, while updating documentation appropriately. I have no reservations about that.
@markmfredrickson, do pipe up if you have thoughts on this If memory
serves, it was originally your suggestion to use the shorter, two-argument
call glm(res3, family=binomial)
.
I've updated the documentation & examples.
I'm also happy with the change. The reason for the two argument version was
to allow direct calls like glm(fill.NAs(z ~ x1 + x2, data = data), family = binomial)
. With the current issues, this may not be possible any longer,
but that might be a small price to pay. If the "fill.NAs as objects" branch
moves along, we may be able to recapture the simpler version.
On Thu, Mar 17, 2016 at 10:56 AM, Josh Errickson notifications@github.com wrote:
I've updated the documentation & examples.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/optmatch/issues/103#issuecomment-197945562
I think we've all signed off now. @josherrickson, please don't hold off making any remaining changes in order to close this out.
Re @markmfredrickson's comments: Now that (if I'm not mistaken) scores
and match_on.glm
/match_on.bigglm
et al. scan their first argument for NAs, applying fill.NAs()
and then refitting in the event that NAs are found, greasing the wheels for users to invoke fill.NAs
directly feels a little less pressing.
That said, I'm also eager to see what comes of the the fill.NAs as objects -branch, for this and other reasons also.
I believe this issue to be closed then. The call to glm(fill.NA's...
may or may not work, but I changed all documentation/examples that I could find to the new version.
While we're making adjustments to
fill.NAs
(cf #100), let's also enable some special handling ofstrata
terms in the calling argument, to match functionality and semantics we're committing to elsewhere (e.g. #87).NA
s in astrata
term should be interpreted as instructions to leave the designated units out of the match, and passed through. (As with NAs on the LHS are.)strata
factors into a dummy variable representation; pass them through as-is. (I.e., no columns of returned data frame labeled "strata"; instead, if the formula containedstrata(eyecolor)
then it just has aneyecolor
column, and if it'sstrata(taken + passed)
then there are columnstaken
andpassed
.)strata
calls. So that functions applied to the result offill.NAs
will be able to recognize strata terms. In addition, use the "terms" attribute of the returned model frame to flag them as specials, specifically "strata" specials. (Related code, if not necessarily model code, appears in myppse.glm
implementation, at least as of this commit.)