ngreifer / WeightIt

WeightIt: an R package for propensity score weighting
https://ngreifer.github.io/WeightIt/
102 stars 12 forks source link

tutorial with code for using sampling weights #34

Closed kaseyzapatka closed 1 year ago

kaseyzapatka commented 2 years ago

Hi @ngreifer!

Thanks so much for a wonderful package and seamless implementation of matching packages. I'm running some robustness checks on an analysis using "ebal" method and I'm getting slightly different coefficients/significant coefficients when I include sampling weights (using s.weights argument) during the matching process and when I don't.

I would assume using the sampling weights during the matching process would adjust the entropy weights and therefore only use the entropy weights in the estimation of the treatment effect. Is this the correction assumption?

Is there some logic I'm missing or a tutorial/vignette using entropy balancing and sampling weights that I could check to make sure my implementation of the sampling weights is correct?

Thanks. Best, Kasey

ngreifer commented 2 years ago

Hi Kasey, thank you so much for the kind words.

If you have sampling weights, you must include them in the estimation of the weights and you must include them in the estimation of the treatment effect by multiplying them by the estimated weights. These steps are both necessary, so it's not a robustness check to omit them.

The reason this is so critical is that entropy balancing ensures the produced weights are representative of the sampled population only when multiplying the sampling weights by the estimated weights, and you need to supply the sampling weights to the weight estimation to ensure the algorithm knows what the weighted population looks like. For example, if the sampling-weighted sample has a mean for covariate x of 5, and the unweighted sample has a mean for x of 10, after entropy balancing without the sampling weights, the resulting sample will have a mean for x of 10, which is not representative of the population you want to generalize the effect to. You must also multiply the entropy balancing weights by the sampling weights because the entropy balancing weights are designed to only balance the covariates after multiplying them by the sampling weights.

I don't have a tutorial up for that but it is something I might add down the line, so thank you for the suggestion. Fortunately, it's really easy. Let's say the sampling weights are stored in the variable sw in the dataset data.

w.out <- weightit(A ~ X1 + X2, data = data, method = "ebal", s.weights = data$sw)

#Checking balance, bal.tab() automatically includes sampling weights
bal.tab(w.out, un = TRUE)

des <- svydesign(~1, data = data, weights = data$sw*w.out$weights)
fit <- svyglm(Y ~ A, design = des)
summary(fit)
kaseyzapatka commented 2 years ago

Thanks, @ngreifer! This is really helpful.

To be clear, I was running models without any survey weights and then again with survey weights to make sure I get similar results in both sets of models--in sociology there's some debate as to whether using survey weights in regression is necessary. Usually, I get the same results. That said, in this case, I was not entering the survey weights correctly in the estimation of the treatment effect--I didn't know you could enter them as data$sw*w.out$weights. So, thanks for that! I figured you need to multiply them but wasn't sure if the that was done under the hood or if you needed to do it manually somehow. I'd think other users might have this same problems, so more documentation might be helpful here. Though this package is well documented.

Also, how do you recommend estimating treatment effects? Simply regress the outcome on the treatment like in your example above, or adding in all model covariates in the estimation of the treatment effect? I've seen it both ways and you have the quote below from your vignette talking about doubly robust estimates.

If we wanted to produce a “doubly-robust” treatment effect estimate, we could add baseline covariates to the glm() (or svyglm()) model (in both the original effect estimation and the confidence interval estimation).

Weirdly, the treatment effect coefficient slightly differs in magnitude across nested models (roughly the same tho) and in some becomes non-significant. Could it be that I'm using a binomial logit link because my outcome is a fraction instead of straight OLS? I'm thinking I might just stick with regressing outcome on treatment, given that is what I most commonly see. Here's my code. I'm comparing treatment effect in fit0 with that of fit1 and fit2.

des <- svydesign(~1, data = data, weights = data$sw*w.out$weights)
fit0 <- svyglm(Y ~ A, design = des, family = binomial(link = "logit"))
fit1 <- svyglm(Y ~ A + X1, design = des, family = binomial(link = "logit"))
fit2 <- svyglm(Y ~ A + X1 + X2, design = des, family = binomial(link = "logit"))
ngreifer commented 2 years ago

Given that you have a fractional outcome, I highly recommend you just use a linear model. The treatment coefficient in fractional logistic regression is uninterpretable; you have to use a marginal effects procedure (e.g., margins in Stata or R) to arrive at an interpretable effect (in this case, the difference in means). Using a marginal effects procedure on a binary or fractional logistic model will give you the same effect estimate as just using a linear model, assuming there are no covariates in the model. The standard errors are in general wrong when you get the outcome distribution wrong, but because you are using robust standard errors (this is what svyglm() does), this doesn't matter, and you will get almost exactly the same standard errors whether you use a linear or (fractional) logistic model.

When you add covariates into the model, things get more complicated. For logistic regression, adding covariates changes the estimand to the conditional effect of the treatment (i.e., within each level of the covariate), whereas linear regression keeps the treatment effect estimate as a marginal effect. So, the interpretation of the coefficient on treatment in the covariate-adjusted logistic regression is different from its interpretation in the unadjusted logistic regression model, which is part of why the statistical conclusion may change. If you include covariates in the logistic regression, you can arrive at a marginal effect estimate by using a marginal effects procedure. But the coefficient on the treatment in a linear regression is unbiased for the marginal effect of treatment even if the true outcome model is logistic (this is why the linear probability model is popular in economics).

Sorry if this was a lot. My recommendation is to use a covariate-adjusted linear regression model with the sampling and entropy balancing weights included. The effect estimate shouldn't change whether you include the covariates or not because of how entropy balancing exactly balances the covariate means, but the standard error on the treatment coefficient will be smaller. You can interpret this coefficient as the marginal difference in means. If you think your readers will be suspicious of this model, you can also report the unadjusted model. Again, you can use either linear or logistic for this, but if you use logistic, you have to use a marginal effects procedure, and that will leave you with the same effect estimate and standard error as the linear model.

One last thing (minutia really) is that when you have a non-binary outcome (or sampling weights) and you're using logistic regression, you might want to set family = quasibinomial which is more technically correct and will prevent a warning about "non-integer #successes".

kaseyzapatka commented 2 years ago

Hey @ngreifer!

Thanks for taking the time to clarify these questions--def not too much. I think your suggestion of using covariate-adjusted linear regression makes the most sense given I'd get the same answer anyways and explains why the coefficient on the treatment was slighly different in each nested model. I was planning to use a marginal effects procedure for fractional regression, but it might be overkill since I'm more interested in the margins effect of treatment.

Thanks for the heads up for the family = quasibinomal!

Kasey