Issues: Chapter 25 - Githubissues

Thank you so much for such an amazing book. I really appreciate the unbelievable depth and clarity of the explanations. While I was reading Chapter 25 (Synthetic Difference in Difference), I found 3 issues.

1. join_weights function

When calculating the join weights in SDID, the chapter reads, This joining process will leave null for the unit weights in the treated group and for the time weights in the post-treatment period. Fortunately, because we use uniform weighting in both cases, it is pretty easy to fill out those nulls. For the time weights, we fill with the average of the post-treatment dummy, which will be 1/T_post for the unit weights, we fill with the average of the treated dummy, which will be 1/N_tr. Finally, we multiply both weights together.

Problem is that the average of post-treatment dummy is not the same as 1/T_post. The average of the treated dummy is not the same as e 1/N_tr either.

Therefore, I propose changing the code to calculate 1/T_post for the unit weights in the treated group and 1/N_tr for the time weights in the post-treatment period. Please see the current code and proposed code below.

# This is the current code
def join_weights(data, unit_w, time_w, year_col, state_col, treat_col, post_col):
    return (
        data
        .set_index([year_col, state_col])
        .join(time_w)
        .join(unit_w)
        .reset_index()
        .fillna({time_w.name: data[post_col].mean(),  # incorrect line here
                 unit_w.name: data[treat_col].mean()}).   # incorrect line here
        .assign(**{"weights": lambda d: (d[time_w.name] * d[unit_w.name]).round(10)})
        .astype({treat_col: int, post_col: int}))

# I propose this code instead
def join_weights(data, unit_w, time_w, year_col, state_col, treat_col, post_col):
    return (
        data
        .set_index([year_col, state_col])
        .join(time_w)
        .join(unit_w)
        .reset_index()
        .fillna({time_w.name: 1 / len(pd.unique(data.query(f"{post_col}")[year_col])),    # new line here
                 unit_w.name: 1 / len(pd.unique(data.query(f"{treat_col}")[state_col]))})    # new line here    
        .assign(**{"weights": lambda d: (d[time_w.name] * d[unit_w.name]).round(10)})
        .astype({treat_col: int, post_col: int}))

2. Visualization of ATT in SDID graph

25-Synthetic-Diff-in-Diff_41_0 3

When interpreting the above figure, the current chapter reads, The difference between the two solid purple lines is the estimated ATT.

If I understood the chapter correctly, this is incorrect. The estimate ATT is the difference between the solid purple line at the bottom and the dashed purple line.

3. Compatibility issue when simulating 3 additional treated states

When creating a new data with 3 additional treated states, the current code assigns np.nan for the non-treated states. However, np.nan is not compatible with the codes that come after. Therefore, I propose using False instead of np.nan.

# This is the current code
new_data = pd.concat([data, tr_state]).assign(**{"after_treatment": lambda d: np.where(d["treated"], d["after_treatment"], np.nan)})

# I propose this code instead
new_data = pd.concat([data, tr_state]).assign(**{"after_treatment": lambda d: np.where(d["treated"], d["after_treatment"], False)})

matheusfacure / python-causality-handbook

Issues: Chapter 25 #333