ngreifer / WeightIt

WeightIt: an R package for propensity score weighting
https://ngreifer.github.io/WeightIt/
102 stars 12 forks source link

Generic variance matrix adjusting for weight estimation? #65

Closed awhug closed 1 month ago

awhug commented 1 month ago

Hi Noah

First of all, a huge thank you not just for this package, but all that you've been contributing to making propensity score weighting/matching easily implemented and interpretable. I've read countless StackExchange/GitHub issue answers from you and it's remarkable how much I've learned - I really appreciate it and I'm sure I'm not alone.

My question concerns finding a robust variance matrix to use after model estimation that accounts for the weighting estimates (e.g. feeding the model and a vcov object into lmtest or marginaleffects).

It seems like this weighting-adjusted matrix is only produced via the glm_weightit() function currently. If a user would like to use a different modelling package outside of stats::glm that still does something under the GLM framework (e.g. bias-reduced GLM as in brglm2::, fixed effects model as in fixest::feglm, etc), it's not clear how they would do this. Just running sandwich::vcovHC ignores the impact of the weighting estimation and returns differing standard errors to those produced by glm_weightit (as the documentation helpfully notes).

Is there any scope within this package to implement a method that exports that a generic vcov object outside of glm_weightit?

Thanks again

ngreifer commented 1 month ago

Hi Angus,

Thank you for the kind words! It means a lot.

I would love to figure out a way to make any model compatible with a covariance matrix that adjusts for estimation of the weights. Unfortunately, it's not very easy to make this happen. Estimating this covariance matrix requires computing each unit's score function--its contributions to the partial derivatives of the outcome model likelihood with respect to the weighting model parameters. That is not an easy function to write in a generic way. Fortunately, it is fairly simple for GLMs, which are the most common outcome model type, which is why I made it compatible with them first. But I have been slowly expanding this capability to other models. In the upcoming version of WeightIt, there will be support for multinomial logistic regression, ordinal regression (using a function similar to MASS::polr()), and bias-reduced GLMs from brglm2. A problem is that I have to add each model type individually, and I have to do this by deriving the score function for each model, which takes a lot of trial and error since my calculus is not very good. There are some models for which this is essentially impossible, such as random effects models, and some models for which this is possible but would be very challenging (e.g., ordered beta regression). Some models I have planned, but the statistical theory isn't well developed enough (Cox models) and some I have ready to go but aren't better than alternatives (negative binomial regression).

Essentially, all I need for a model is a function in the form:

psi <- function(B, X, Y, W) {}

that takes in a vector of outcome model coefficients B, the outcome model design matrix X, the outcome vector Y, and the weights W. This function needs to return a matrix with a row for each unit and a column for each coefficient. The function needs to have the following property: when and only when the optimal outcome coefficients B are supplied (i.e., those resulting from the weighted model fit), the column sums of the output are equal to 0. The roadblock in getting new models added is simply in writing this function (and creating a user-friendly interface); if there was a way to automatically write this function, then the problem would be solved. Until then, though, I have to manually write each one, and that means I have to derive each one. Maybe we can start a movement to crowdsource writing of these functions. I have been using ChatGPT to help me, but as you probably know, it is often inaccurate and I have to do a lot of checking to make sure its output is correct.

For fixed effect models, you should just able to include the fixed effect variable as a predictor. The only difference between doing so and using feols(), etc., is computational performance. Of course, with large datasets, the full design matrix can get huge when using fixed effects with many levels. I might look into adding this as a capability.

I would be happy to hear any more suggestions you have to make the package better.

awhug commented 1 month ago

Thanks for this Noah

Happy to see this closed as I feel the original issue is answered, but one further question out of curiosity - what will the design of the new model functionality be? At the moment, I'm imagining for ease of use and possibly practicality each will resemble (or even extend?) the existing glm_weightit, so basically acting as wrapper functions. Is that right?

I ask because I wonder if there's any utility in designing a user interface somewhat akin to the sandwich package where a user simply specifies they want a variance matrix produced, and the function infers the correct method of calculating this (I'd guess using the model class), without needing to refit the model per se. This could plausibly make your life easier as well, because users could then input their own custom psi function as an argument and possibly contribute them to the package?

ngreifer commented 1 month ago

Hi Angus,

The design will be very similar to glm_weightit(): for multinomial regression, there will be mlogit_weightit(), which has an interface like that of nnet::multinom(); for ordinal regression, there will be polr_weightit(), which has an interface like that of MASS::polr(); for bias-reduced GLMs, there will be a br argument passed to glm_weightit(), which, when TRUE, calls brglm2::brglmFit() and uses the bias-corrected score functions. They are actually not wrapper functions; I coded my own version of multinomial logistic regression and ordinal regression to keep the dependencies low. I made a few changes to these functions to better support features I think are important. For coxph_weightit(), which is already available, I just call survival::coxph() since Cox models are a bit foreign to me.

I like your idea and see how it could be added in the future. I don't want to go too deep into this because really I just want to make the most commonly used models available without having to rewrite all of R. What you're describing is close to an API for M-estimation, which already exists in the R package geex. This is a bit different since it's focused on weighting, but I need to strike a balance between being completely general and flexible and having control over what goes in my package. For now, I'll be adding a few models myself manually, adding some as they are requested or I find them interesting. Of course I will take user suggestions and requests for specific models, but I think a fully flexible API is too ambitious for me at this point.

awhug commented 1 month ago

Nice! That makes sense to me (especially in keeping dependencies down, considering past painful experience using mlogit).

I have no immediate need for any specific models personally, although I do have a large dataset that I was considering using fixest with. In practice, I suspect a regular GLM will fit my needs, albeit a bit slowly.

Appreciate the thoughtful responses as always