naoki-egami / DIDdesign

R package DIDdesign: Analyzing Difference-in-Differences Design
GNU General Public License v2.0
35 stars 7 forks source link

Issue with reporting confidence intervals when calling summary() #41

Closed maxikagan closed 3 months ago

maxikagan commented 6 months ago

When calling summary() on a DIDdesign (std), the confidence intervals appear flipped between DID and sDID.

The Double-DID confidence intervals appear correct, but for each lead level of DID and sDID these two appear flipped. If I manually calculate the CIs based upon the estimate and std.error generated by the summary() call then it appears that these are mismatched.

sou412 commented 3 months ago

Thank you for reporting the issue.

Reproducing the issue

I was able to reproduce the issue. As we can see in the code chunk below, the CI for DID is not appropriate (LB > UB):

> set.seed(1234)
> fit_panel <- did(
+   formula = lnavgsalary_cpi ~ oncycle | teachers_avg_yrs_exper +
+     ami_pc + asian_pc + black_pc + hisp_pc,
+   data = anzia2012,
+   id_unit = "district",
+   id_time = "year",
+   design = "did",
+   is_panel = TRUE,
+   option = list(n_boot = 20, parallel = TRUE, lead = 0, se_boot = FALSE)
+ )
> summary(fit_panel)
── ATT Estimates ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   estimator lead estimate std.error statistic p_value   ci.low ci.high
1 Double-DID    0  -0.0065    0.0028      -2.3   0.019 -0.01192 -0.0011
2        DID    0  -0.0062    0.0028      -2.2   0.027 -0.01169 -0.0124
3       sDID    0  -0.0044    0.0041      -1.1   0.282 -0.00071  0.0036

Identifying causes

There are two major bugs associated with the incorrect reporting of confidence intervals.

  1. The CI estimates for DID and sDID are organized in the column-wise format: https://github.com/naoki-egami/DIDdesign/blob/b09fa0fc5df32a4f43977fca9eb3f06705bde23d/R/did-std.R#L230-L231

    However, the output is parsed row-wise: https://github.com/naoki-egami/DIDdesign/blob/b09fa0fc5df32a4f43977fca9eb3f06705bde23d/R/did-std.R#L240-L241

    This results in the inccorect CI reporting. Double-DID estimates are not affected by this bug.

  2. The bootstrap option se_boot = TRUE was not functioning. This is due to the change introduced in dac836a where se_option_gmm (introduced in 25c5f20) was replaced with se_option, while the argument passed to the estimation function remained unchanged: https://github.com/naoki-egami/DIDdesign/blob/b09fa0fc5df32a4f43977fca9eb3f06705bde23d/R/did-std.R#L60.

    Since the output from the bootstrap is organized row-wise, it created an inconsistency resulted in the issue 1.

Fix

Before the fix

Before the fix, flipping se_boot from FALSE to TRUE does not affect the reported CIs. (Also CIs for DID and sDID were incorrect).

> set.seed(1234)
> fit_panel <- did(
+   formula = lnavgsalary_cpi ~ oncycle | teachers_avg_yrs_exper +
+     ami_pc + asian_pc + black_pc + hisp_pc,
+   data = anzia2012,
+   id_unit = "district",
+   id_time = "year",
+   design = "did",
+   is_panel = TRUE,
+   option = list(n_boot = 20, parallel = TRUE, lead = 0, se_boot = TRUE)
+ )
> summary(fit_panel)
── ATT Estimates ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   estimator lead estimate std.error statistic p_value   ci.low ci.high
1 Double-DID    0  -0.0065    0.0028      -2.3   0.019 -0.01192 -0.0011
2        DID    0  -0.0062    0.0028      -2.2   0.027 -0.01169 -0.0124
3       sDID    0  -0.0044    0.0041      -1.1   0.282 -0.00071  0.0036

After the fix

After the fix, CIs are different based on se_boot option and the CIs are correctly reported.

> fit_panel <- did(
+   formula = lnavgsalary_cpi ~ oncycle | teachers_avg_yrs_exper +
+     ami_pc + asian_pc + black_pc + hisp_pc,
+   data = anzia2012,
+   id_unit = "district",
+   id_time = "year",
+   design = "did",
+   is_panel = TRUE,
+   option = list(n_boot = 20, parallel = TRUE, lead = 0, se_boot = FALSE)
+ )
> summary(fit_panel)
── ATT Estimates ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   estimator lead estimate std.error statistic p_value ci.low  ci.high
1 Double-DID    0  -0.0065    0.0028      -2.3   0.019 -0.012 -0.00108
2        DID    0  -0.0062    0.0028      -2.2   0.027 -0.012 -0.00071
3       sDID    0  -0.0044    0.0041      -1.1   0.282 -0.012  0.00361
> set.seed(1234)
> fit_panel <- did(
+   formula = lnavgsalary_cpi ~ oncycle | teachers_avg_yrs_exper +
+     ami_pc + asian_pc + black_pc + hisp_pc,
+   data = anzia2012,
+   id_unit = "district",
+   id_time = "year",
+   design = "did",
+   is_panel = TRUE,
+   option = list(n_boot = 20, parallel = TRUE, lead = 0, se_boot = TRUE)
+ )
> summary(fit_panel)
── ATT Estimates ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   estimator lead estimate std.error statistic p_value ci.low  ci.high
1 Double-DID    0  -0.0065    0.0028      -2.3   0.019 -0.010 -0.00108
2        DID    0  -0.0062    0.0028      -2.2   0.027 -0.010 -0.00055
3       sDID    0  -0.0044    0.0041      -1.1   0.282 -0.011  0.00323
sou412 commented 3 months ago

Tested the updated code by building the readme file. The output is in 52c0c40.