formula bal.tab for binary variables?

mbloechl05 commented 5 years ago

I was wondering about the specific formula you use to calculate balance diagnostics for binary variables? I have read and understood your explanation in the function documentation (https://www.rdocumentation.org/packages/cobalt/versions/3.7.0/topics/bal.tab). However, when I check the standardised solution of the function, it does not seem to be consistent with the solution by Austin, 2009 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3472075/), which is often used.

So are you calculating a different standardised solution? If so, how?

ngreifer commented 5 years ago

By default, bal.tab computes raw differences in proportion for binary variables. This is because it doesn't make sense to standardize binary variables. If you set binary = "std", then bal.tab will compute standardized mean differences for binary variables.

The denominator of the standardized mean difference depends on the estimand you're using. If you're preprocessing for the ATE, then the formula in Austin (2009) will be used. If you're preprocessing for the ATT, then the denominator will be the standard deviation of the covariate in the treated group. This is to remain consistent with MatchIt and twang. Also, the denominator is always computed using the unadjusted (i.e., unweighted, unmatched) variance(s). This is recommended by Stuart (2010) so that changes in mean difference are not conflated with changes in variance.

The formulas I'm using are the following, where x is the binary covariate, t is the treatment, and w are the weights: For the ATE, (weighted.mean(x[t==1], w[t==1]) - weighted.mean(x[t==0], w[t==0])) / sqrt((var(x[t==1]) + var(x[t==0]))/2) For the ATT, (weighted.mean(x[t==1], w[t==1]) - weighted.mean(x[t==0], w[t==0])) / sqrt(var(x[t==1]))

Let me know if this answers your question. If you've taken all this into account and you're still getting discrepancies, please let me know.

mbloechl05 commented 5 years ago

Great, thanks for your quick reply! I have used bal.tab with the argument binary = "std" for an ATE.

And I have double checked the result by calculating the standardised mean difference by hand in R. I used the formula specifically derived for binary variables, which is given as: d = mean(x1)-mean(x2)/sqrt(((mean(x1)*(1-mean(x1)))+(mean(x2)*(1-mean(x2))))/2)

(see formula 2 in Austin, 2009)

This will, however, necessarily differ from the standardised solution in bal.tab because
(weighted.mean(x[t==1], w[t==1]) - weighted.mean(x[t==0], w[t==0])) / sqrt((var(x[t==1]) + var(x[t==0]))/2), doesn't take into account that the bernulli distribution of the binary variable has one parameter less than the distribution of a continuous variable. So you should adjust the dfs when calculating the variance.

I have simulated a minimal example here:

x1 <- (runif(50)<=.75)+0
x2 <- (runif(50)<=.50)+0
n1 <- length(x1)
n2 <- length(x2)

# use without df correction (as in bal.tab)
st_diff_1 <- mean(x1)-mean(x2)/sqrt((var(x1)+var(x2))/2)

# formula derived by Austin 2009 for binary variables
st_diff_2 <- mean(x1)-mean(x2)/sqrt(((mean(x1)*(1-mean(x1)))+(mean(x2)*(1-mean(x2))))/2)

# use with df correction
st_diff_3 <- mean(x1)-mean(x2)/sqrt(((var(x1)*(n1-1)/n1)+(var(x2)*(n2-1)/n2))/2)

This gives:

> st_diff_1
[1] -0.006632344
> st_diff_2
[1] -0.01482171
> st_diff_3
[1] -0.01482171

So the first one (as implemented in bal.tab) gives gives a different estimate than the other 2 solutions.

Does that make sense?

ngreifer commented 5 years ago

You're absolutely right, and I had totally overlooked that. I had originally written cobalt to be a supplement to MatchIt and wanted the results to be the same in the two packages, and MatchIt uses the sample SD formula for both binary and continuous variables. I'll look into implementing the correct formula.

ngreifer commented 5 years ago

Note: the new twang (and possibly the old twang) correctly use Austin's formulas.

mbloechl05 commented 5 years ago

Perfect, many thanks for implementing this so quickly! Its indeed super easy to overlook this; I just caught it by chance.

ngreifer / cobalt

formula bal.tab for binary variables? #21