gphk-metrics / stata-multe

Multiple Treatment Effects
21 stars 5 forks source link

Pull Request for #7: Add weights #13

Closed mcaceresb closed 7 months ago

mcaceresb commented 2 years ago

I've implemented aweight (default) and fweight for multe. aweights are rescaled so the sum adds up to the number of observations; fweights consider the weights to indicate that many copies of each observation.

Help with Formulas

@peterdhull @kolesarm I've implemented some consistency checks. I'm confident in the fweight implementation as well as aweight for ATE and OAAT, since I was able to check them against teffects and regress. However, the common weights and the decomposition I am less sure about. If you have any time, could you either review the snippets I link below or LMK point me to the right weighted formulas?

Below I show the consistency checks I've coded and I also have a note on why fweights and aweights compute different variances, but feel free to skip.

Consistency checks

You can see in test/test_weights.do I've implemented some consistency checks for weights using the Star data. Up to a df adjustment, the ATE and OAAT SEs match regress and tefffects for fweight and aweight (teffects does not allow aweight; I did the rescaling manually and got the same answer). Here's a sample snippet:

set seed 1729
use test/example_star.dta, clear
gen `c(obs_t)' G = _n
gen byte _expand = ceil(runiform() * 10)
qui sum _expand
gen double _aw = _N * _expand / r(sum)

multe score treatment             , control(school)
multe score treatment [aw=_expand], control(school)
multe score treatment [fw=_expand], control(school)

teffects ipw (score) (treatment i.school)
teffects ipw (score) (treatment i.school) [pw=_aw]
teffects ipw (score) (treatment i.school) [fw=_expand]

A note on the variance formulas

Because of the rescaling, things like averages, etc. are the same. However, the interpretation leads to different formulas for the vcov beyond just a sample-size adjustment. For analytic weights (default), for example, the OLS variance is

$$ V((X' W X)^{-1} X' W e) = (X' W X)^{-1} X' W V(e) W X (X' W X)^{-1} $$

For frequency weights, it's

$$ (X' W X)^{-1} X' W V(e) X (X' W X)^{-1} $$

The reason is that, unweighted,

$$ X' V(e) X = \sum_i e_i^2 x_i' x_i $$

With frequency weights, the exact same sum is given by

$$ \sum_j w_j e_j^2 x_j' x_j $$

with $w_j$ the size of each group. However, with analytic weights we have $w_j^2$ instead. Here's a snippet where you can see this explicitly:

sysuse auto, clear
mata {
    Y    = st_data(., "price")
    X    = st_data(., "turn"), J(st_nobs(), 1, 1)
    w    = st_data(., "mpg")
    AW   = diag(st_nobs() * w / sum(w))
    FW   = diag(w)
    b    = invsym(X' * AW * X) * X' * AW * Y
    reldif(b, invsym(X' * FW * X) * X' * FW * Y)
    e    = Y - X * b
    Vaw  = invsym(X' * AW * X) * X' * AW * diag(e:^2) * AW * X * invsym(X' * AW * X)
    Vfw  = invsym(X' * FW * X) * X' * FW * diag(e:^2) * X * invsym(X' * FW * X)
    qcaw = st_nobs() / (st_nobs() - 2)
    qcfw = sum(w) / (sum(w) - 2)
    seaw = sqrt(diagonal(Vaw) * qcaw)
    sefw = sqrt(diagonal(Vfw) * qcfw)
}

* I can mimic aw and fw separately, and note the formulas change beyond the df adjustment
reg price turn [aw = mpg], r
mata reldif(seaw, st_matrix("r(table)")[2, .]')
reg price turn [fw = mpg], r
mata reldif(sefw, st_matrix("r(table)")[2, .]')
mata reldif(sefw, seaw)

* And if you expand by the weight values, only fw mimics that
expand mpg
reg price turn, r
mata reldif(sefw, st_matrix("r(table)")[2, .]')
mata reldif(seaw, st_matrix("r(table)")[2, .]')
kolesarm commented 2 years ago

You're right, for fweight, the variance formula is different, because we are pretending that each observation $i$ actually corresponds to $w_i$ identical observations. This makes $\sum_i w_i$ the sample size, and the variance for OLS is as you say, $\sum_i w_i e_i^2 X_i X_i'$ in the middle of the sandwich, because it's as if we observed the error $e_i$ $w_i$ times.

So a simple consistency check for fweight is to create two datasets, one with several duplicated observations, and one with only unique observations. Applying fweight to the dataset with unique observations should yield numerically the same standard error as applying no weighting (including in cases like efficient common weights etc).

I don't actually know of any cases where this is super useful, since datasets rarely come in this form, but we can have the option for completeness.

For aweight, apart from renormalizing the weights to sum to one (which I don't think should actually matter, except perhaps for some finite-sample corrections), the weight $w_i$ gets squared in the formula. In linear regression, it should be equivalent to regressing $\sqrt{w_i}Y_i$ on $\sqrt{w_i}X_i$. This is the more sensible default in most cases. For common weights, there is a question of what the estimation target should be. One guiding principle could be that what we're trying to do is to target a treatment effect that weights states equally, even though unweighted regression would give more weight to states that have bigger population --- but I think that this is what pweight is supposed to do?

mcaceresb commented 2 years ago

For aweight, apart from renormalizing the weights to sum to one (which I don't think should actually matter, except perhaps for some finite-sample corrections),

@kolesarm This is probably true in theory, but it's not obvious to me how to implement weights throughout the code so that re-scaling never matters. For instance, the weighted average of $x_i$ with $w_i$ re-scaled is the same as the average of $x_i \cdot w_i$, but not so if it's not re-scaled. Hence while there are many places where multiplying $w_i$ by a constant would cancel out, like in the sandwhich formula, this is not always the case, and I'm not sure what the correct weighted formulas are in those other cases.

(By the by, this is actually why I haven't added pweight; the SEs for ATE without re-scaling are wrong.)

kolesarm commented 2 years ago

Of course, you may need to properly define what you mean by the estimand if you don't re-scale the weights: for example, when weighting by inverse variance, it helps with interpretation if we don't do any rescaling, but need to take that into account when defining object of interest. Usually people would define the weighted average as $\sum_i w_i x_i / \sum_i w_i$, since that's the minimized of the objective $\sum_i w_i(x_i-b)^2$, and similarly for regression. This is how it's done in R, I think.

For ATE, you need to define the object of interest as a weighted ATE, etc.

mcaceresb commented 7 months ago

Continued in #14