markmfredrickson / RItools

Randomization inference tools for R
GNU General Public License v2.0
17 stars 11 forks source link

store Design info w/ `ipr` calcs #83

Closed benthestatistician closed 2 years ago

benthestatistician commented 7 years ago

The ipr function is intended to be invoked on the fly when invoking a model-fitting function:

lm0 = lm(y ~ z, weights=ipr(z, strata, clus), data=dat)

At present it just returns a vector of weights. Could we arrange things so that the study Design info generated along the way would be saved in the fitted model object, here lm0, and available for RItools functions to work with as needed but otherwise staying out of peoples' way?

benthestatistician commented 7 years ago

A way to do something like what I have in mind would be to attach the Design object to the generated weights, as an attribute. Three concerns about that specific solution, however:

  1. If somebody tried to look at weights, or a subset of them, I suspect the Design attribute would wind up getting printed to the screen alongside of them. So, not quite staying out of people's way.
  2. Maybe the attribute gets lost if somebody does something like the following:
lm1 = lm(y ~ z, weights=ipr(z,strata,clus)*prior_sampling_weights, data=dat)

Both concerns could be addressed by having ipr output not just a vector but an object of some new class of our devising, one that carries the Design info in addition to unit weights, that we can write print and other methods for, and that looks like a numeric vector of weights to most functions that are external to RItools.

benthestatistician commented 7 years ago

Assigning this to Mark for consideration & comment.

benthestatistician commented 7 years ago

How about using the .Data slot of eem:::Design objects to store inverse probability of assignment weights? They're there for back compatibility with S3, which strikes me as prima face favorable to this usage scenario, and to continued support of it.

Example:

#' Concept: pass design objects as weights to `lm` and 
#' friends, relying on class coercions to convert these
#' to numeric vectors of IPW weights when that is 
#' appropriate, otherwise maintain structure
#' `lm` test setup cribbed from `example(lm.wfit)`
n <- 7 ; p <- 2
X <- matrix(rnorm(n * p), n, p) 
y <- rnorm(n)
dat = data.frame(y=y, X)
setClass("AssignmentWeights",
         contains = "numeric",
         slots = list(type = "character", treatment_levels = "factor"))
w2 <- new("AssignmentWeights", 1L:n, type="ETT", 
          treatment_levels=factor(letters[1L:n]))
lm2 = lm(y~X1+X2, data=dat, weights=w2)

with this windup,

> class(lm2$weights)
[1] "integer"
> #' grrr...However,
> class(lm2$model$"(weights)")
[1] "AssignmentWeights"
attr(,"package")
[1] ".GlobalEnv"
> lm2$model$"(weights)"@type
[1] "ETT"
markmfredrickson commented 7 years ago

We haven't had great luck forcing lm to do what we want. What about a "design based lm" function dblm (or some similar name) that wraps lm and returns an appropriate object? This would be useful if we wanted to override the computation of standard errors to be appropriately design based as well.

As a rule, attributes are pretty fragile. They don't get copied around as one might expect.

That said, if you would prefer to link into the default lm, I have no argument with including the IP weights as the .Data slot. It is otherwise not being used.

benthestatistician commented 7 years ago

I think I do still prefer to link into the default lm, for reasons somewhat orthogonal to @markmfredrickson's reasons not too. But I agree that unexpected loss of attributes is a real concern, and I've been doing various little test cases behind the scenes to try build a sense of when I will and won't see it. My basic conclusion is that many if not all commonly used model building functions leave attributes as-is when bundling a column into a model frame, and that's essentially what would be needed for the purposes I have in mind.

While I don't have anything against dedicated functions to wrap lm and so forth, what I'd really like is something more along the lines of a dbsummary function, with methods for lm's and other stuff. I envision it supporting not only design based standard errors of things but automatic stacking of estimating equations when appropriate, e.g. when the the model fitting call being summarized itself invoked previous model fits via @josherrickson's scores operator.

Anyway it's useful that Mark's instincts about this run somewhat opposite to mine -- perhaps that'll help him to quickly identify overly wishful thinking on my part. Mark is it worth my pasting some of my code experiments into this thread, the ones leading me to think that columns of model frames are a good place to store lots of hidden S4 attributes, so you can point out flaws in my thinking?

benthestatistician commented 7 years ago

Here are the common model-fitting functions I've looked at so far, with tests of preservation of S4 class structure that I've applied.

#' Functions that appear to be compatible
#' glm works
glm4 = glm(y~X1+X2, data=dat, weights=w4)
isS4(glm4$model$"(weights)")
is(glm4$model$"(weights)")
glm4$model$"(weights)"@type=="ETT"

#' robustbase:lmrob
#' 
lmrb4 = robustbase::lmrob(y~X1+X2, data=dat, weights=w4)
isS4(lmrb4$model$"(weights)")
is(lmrb4$model$"(weights)")
lmrb4$model$"(weights)"@type=="ETT"

#' robustbase:glmrob
#' Notes: Warning generated in initial fit suggests AssignmentWeights
#' ought to have a names slot
glmrb4 = robustbase::glmrob(y~X1+X2, data=dat, family=gaussian, weights=w4)
isS4(glmrb4$model$"(weights)")
is(glmrb4$model$"(weights)")
glmrb4$model$"(weights)"@type=="ETT"

#' lme4:lmer
#' 
data(Orthodont,package="nlme")
ortho_w4 <- new("AssignmentWeights",
                1+as.numeric(Orthodont$Sex=="Male"),
                type="ETT", 
          treatment_levels=factor(letters[1L:n]))
anlme4 = lme4::lmer(distance ~ age + (1|Subject), data=Orthodont, weights=ortho_w4)
isS4(model.frame(anlme4)$"(weights)")
is(model.frame(anlme4)$"(weights)")
model.frame(anlme4)$"(weights)"@type=="ETT"

#' Functions that appear incompatible
#' MASS::rlm 
inherits(try(MASS::rlm(y~X1+X2, data=dat, weights=w4)), "try-error")
jwbowers commented 6 years ago

Hi All, I'm trying to get a sense of the workflow here. The idea is for us to have some recommended estimation of causal effects procedures in RItools --- to be used after matching or in the case of an experiment? And Ben's insight is that we can boil a lot of the estimation issues down to appropriately representing the design using weights. But that, in contrast to svyglm and svydesign in the survey package, we'd like to just produce weights that other common tools can use and another function that produces p-values and/or confidence intervals and/or standard errors that reflect the design (like a dbsummary command following lm with some weights function).

We would then do:

thelm<-lm(y~z,weights=designweights)
dbsummary(thelm)

...rather than ...

## for trt=0,1
dat <- dat %>% group_by(block) %>% mutate(trtprob = mean(trt),
nbwt=trt/trtprob + (1-trt)/(1-trtprob),
nb=n(),
nTb = sum(trt),
nCb = nb - nTb,
hbwt = ( 2*( nCb * nTb ) / (nTb + nCb)),
)
## using the block size weights (nbwt) rather than the harmonic mean weights (hbwt)
thelm <- lm(y~trt,data=dat,weights=nbwt)
coeftest(thelm,vcov=vcovHC(thelm,"HC2"))

Am I getting the point of this idea? This last thing is what I have been doing (except with harmonic weights when the proportions treated within blocks vary greatly).

benthestatistician commented 6 years ago

Yep, that's the basic idea I had in mind!

jwbowers commented 6 years ago

Ok. So the methodological issue, beyond software, is that lm+vcovHC can produce confidence intervals with incorrect coverage in certain situations. (We also know that there is a precision trade off between different ways of defining the target of estimation —- i.e. With unequal probabilities of assignment by block the harmonic weight estimand is estimated more precisely than the block-size weight estimand).

I’m not clear on bias issues here although I’ve seen people say one or another estimator is biased — when they define the estimand in one way and the estimator in another way.

benthestatistician commented 5 years ago

I've closed #91, but (for ease of reference) here's the comment from there that had referenced this issue.

  1. Instead of a structure for carrying multiple design "options" referring to a common array of subjects, better to just have a structure for a single design, w/ varying elaborations on that structure: this will make it easier to link balT()-related calcs and workflows w/ #83; goes with avoiding the building and storing a list of (potentially large) CovsAlignedToADesign objects in favor of instead completing calculations on the one before moving to the building of the next one.
  2. Each object encoding a design should encode stratification info in terms of the clusters rather than the elements, when that distinction arises. Separately it should store the mapping of elements to clusters.
  3. I have a use case now for side-by-side presentation of distinct balanceTest() type calculations that involve different clustering variables. The current premise of xBal and balT being that there will be occasion for comparing different stratifications side-by-side, but the clustering underlying the alternatives should always be the same. I don't propose to clutter the balT UI to accommodate this, but as #94 progresses we might be able to accommodate it from a UI perspective. Dropping the DesignOptions premise that the sequence of designs for which calculations should be presented alongside one another should all have the same clustering, will make such accommodation easier and more natural.
benthestatistician commented 2 years ago

Flexida project has a revised & improved version of this; closing.