Closed mbloechl05 closed 5 years ago
By default, bal.tab
computes raw differences in proportion for binary variables. This is because it doesn't make sense to standardize binary variables. If you set binary = "std"
, then bal.tab
will compute standardized mean differences for binary variables.
The denominator of the standardized mean difference depends on the estimand you're using. If you're preprocessing for the ATE, then the formula in Austin (2009) will be used. If you're preprocessing for the ATT, then the denominator will be the standard deviation of the covariate in the treated group. This is to remain consistent with MatchIt
and twang
. Also, the denominator is always computed using the unadjusted (i.e., unweighted, unmatched) variance(s). This is recommended by Stuart (2010) so that changes in mean difference are not conflated with changes in variance.
The formulas I'm using are the following, where x
is the binary covariate, t
is the treatment, and w
are the weights:
For the ATE,
(weighted.mean(x[t==1], w[t==1]) - weighted.mean(x[t==0], w[t==0])) / sqrt((var(x[t==1]) + var(x[t==0]))/2)
For the ATT,
(weighted.mean(x[t==1], w[t==1]) - weighted.mean(x[t==0], w[t==0])) / sqrt(var(x[t==1]))
Let me know if this answers your question. If you've taken all this into account and you're still getting discrepancies, please let me know.
Great, thanks for your quick reply! I have used bal.tab
with the argument binary = "std"
for an ATE.
And I have double checked the result by calculating the standardised mean difference by hand in R. I used the formula specifically derived for binary variables, which is given as:
d = mean(x1)-mean(x2)/sqrt(((mean(x1)*(1-mean(x1)))+(mean(x2)*(1-mean(x2))))/2)
(see formula 2 in Austin, 2009)
This will, however, necessarily differ from the standardised solution in bal.tab
because
(weighted.mean(x[t==1], w[t==1]) - weighted.mean(x[t==0], w[t==0])) / sqrt((var(x[t==1]) + var(x[t==0]))/2)
,
doesn't take into account that the bernulli distribution of the binary variable has one parameter less than the distribution of a continuous variable. So you should adjust the dfs when calculating the variance.
I have simulated a minimal example here:
x1 <- (runif(50)<=.75)+0
x2 <- (runif(50)<=.50)+0
n1 <- length(x1)
n2 <- length(x2)
# use without df correction (as in bal.tab)
st_diff_1 <- mean(x1)-mean(x2)/sqrt((var(x1)+var(x2))/2)
# formula derived by Austin 2009 for binary variables
st_diff_2 <- mean(x1)-mean(x2)/sqrt(((mean(x1)*(1-mean(x1)))+(mean(x2)*(1-mean(x2))))/2)
# use with df correction
st_diff_3 <- mean(x1)-mean(x2)/sqrt(((var(x1)*(n1-1)/n1)+(var(x2)*(n2-1)/n2))/2)
This gives:
> st_diff_1
[1] -0.006632344
> st_diff_2
[1] -0.01482171
> st_diff_3
[1] -0.01482171
So the first one (as implemented in bal.tab) gives gives a different estimate than the other 2 solutions.
Does that make sense?
You're absolutely right, and I had totally overlooked that. I had originally written cobalt
to be a supplement to MatchIt
and wanted the results to be the same in the two packages, and MatchIt
uses the sample SD formula for both binary and continuous variables. I'll look into implementing the correct formula.
Note: the new twang
(and possibly the old twang
) correctly use Austin's formulas.
Perfect, many thanks for implementing this so quickly! Its indeed super easy to overlook this; I just caught it by chance.
I was wondering about the specific formula you use to calculate balance diagnostics for binary variables? I have read and understood your explanation in the function documentation (https://www.rdocumentation.org/packages/cobalt/versions/3.7.0/topics/bal.tab). However, when I check the standardised solution of the function, it does not seem to be consistent with the solution by Austin, 2009 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3472075/), which is often used.
So are you calculating a different standardised solution? If so, how?