matheusfacure / python-causality-handbook

Causal Inference for the Brave and True. A light-hearted yet rigorous approach to learning about impact estimation and causality.
https://matheusfacure.github.io/python-causality-handbook/landing-page.html
MIT License
2.61k stars 456 forks source link

Issues: Chapter 25 #333

Closed swlee88 closed 1 year ago

swlee88 commented 1 year ago

Thank you so much for such an amazing book. I really appreciate the unbelievable depth and clarity of the explanations. While I was reading Chapter 25 (Synthetic Difference in Difference), I found 3 issues.

1. join_weights function

When calculating the join weights in SDID, the chapter reads, This joining process will leave null for the unit weights in the treated group and for the time weights in the post-treatment period. Fortunately, because we use uniform weighting in both cases, it is pretty easy to fill out those nulls. For the time weights, we fill with the average of the post-treatment dummy, which will be 1/Tpost for the unit weights, we fill with the average of the treated dummy, which will be 1/Ntr. Finally, we multiply both weights together.

Problem is that the average of post-treatment dummy is not the same as 1/Tpost. The average of the treated dummy is not the same as e 1/Ntr either.

Therefore, I propose changing the code to calculate 1/Tpost for the unit weights in the treated group and 1/Ntr for the time weights in the post-treatment period. Please see the current code and proposed code below.

# This is the current code
def join_weights(data, unit_w, time_w, year_col, state_col, treat_col, post_col):
    return (
        data
        .set_index([year_col, state_col])
        .join(time_w)
        .join(unit_w)
        .reset_index()
        .fillna({time_w.name: data[post_col].mean(),  # incorrect line here
                 unit_w.name: data[treat_col].mean()}).   # incorrect line here
        .assign(**{"weights": lambda d: (d[time_w.name] * d[unit_w.name]).round(10)})
        .astype({treat_col: int, post_col: int}))

# I propose this code instead
def join_weights(data, unit_w, time_w, year_col, state_col, treat_col, post_col):
    return (
        data
        .set_index([year_col, state_col])
        .join(time_w)
        .join(unit_w)
        .reset_index()
        .fillna({time_w.name: 1 / len(pd.unique(data.query(f"{post_col}")[year_col])),    # new line here
                 unit_w.name: 1 / len(pd.unique(data.query(f"{treat_col}")[state_col]))})    # new line here    
        .assign(**{"weights": lambda d: (d[time_w.name] * d[unit_w.name]).round(10)})
        .astype({treat_col: int, post_col: int}))

2. Visualization of ATT in SDID graph

25-Synthetic-Diff-in-Diff_41_0 3

When interpreting the above figure, the current chapter reads, The difference between the two solid purple lines is the estimated ATT.

If I understood the chapter correctly, this is incorrect. The estimate ATT is the difference between the solid purple line at the bottom and the dashed purple line.

3. Compatibility issue when simulating 3 additional treated states

When creating a new data with 3 additional treated states, the current code assigns np.nan for the non-treated states. However, np.nan is not compatible with the codes that come after. Therefore, I propose using False instead of np.nan.

# This is the current code
new_data = pd.concat([data, tr_state]).assign(**{"after_treatment": lambda d: np.where(d["treated"], d["after_treatment"], np.nan)})

# I propose this code instead
new_data = pd.concat([data, tr_state]).assign(**{"after_treatment": lambda d: np.where(d["treated"], d["after_treatment"], False)})