What‘s the synth.summary() about?

cuusa commented 1 year ago

I'm a little confused about the output of summary() Like what is shown below, why the sample mean of infrate is much larger than that in treated and synthetic group? And infrate in treated is 2.595, how do I explain this 2.595? The mean infrate of pre or post or the whole period?

sdfordham commented 1 year ago

The sample mean is a unit contructed as the mean of all the control units, i.e. this is the same as equal weights. The infrate in treated is the statistical value of infrate for West Germany over the selected period, i.e. if we look at the Dataprep object, we have time_predictors_prior=range(1981, 1991) and predictors_op="mean" so therefore it is the mean of infrate over the period 1981-1990.

It's not surprising that the sample mean is much different, because as I said this is the "naïve synthetic control" where we just average over all the control units. We would hope (if the synthetic control method has been successful) that the predictor values for the synthetic and treated are close, for then this means that the synthetic control does in fact look like the treated unit in the pre-intervention period for our selected predictors.

The purpose of this summary function is try to give the user some kind of answer to the following question: what does the treated unit look like vs. the synthetic control unit vs. a naïve average over all control units, in the pre-intervention period from a predictor point of view? If we see that some predictor is badly matched in the synthetic control, but also the sample mean is way off then can give us an idea about e.g. uniqueness of the treated unit vis-à-vis this predictor etc. It helps with analysing whether we think we have a good synthetic control.

(Note also that this function is identical to the same function in the R package synth.)

cuusa commented 1 year ago

The sample mean is a unit contructed as the mean of all the control units, i.e. this is the same as equal weights. The infrate in treated is the statistical value of infrate for West Germany over the selected period, i.e. if we look at the Dataprep object, we have time_predictors_prior=range(1981, 1991) and predictors_op="mean" so therefore it is the mean of infrate over the period 1981-1990.

It's not surprising that the sample mean is much different, because as I said this is the "naïve synthetic control" where we just average over all the control units. We would hope (if the synthetic control method has been successful) that the predictor values for the synthetic and treated are close, for then this means that the synthetic control does in fact look like the treated unit in the pre-intervention period for our selected predictors.

The purpose of this summary function is try to give the user some kind of answer to the following question: what does the treated unit look like vs. the synthetic control unit vs. a naïve average over all control units, in the pre-intervention period from a predictor point of view? If we see that some predictor is badly matched in the synthetic control, but also the sample mean is way off then can give us an idea about e.g. uniqueness of the treated unit vis-à-vis this predictor etc. It helps with analysing whether we think we have a good synthetic control.

(Note also that this function is identical to the same function in the R package synth.)

Thanks for your reply! I wonder what should I do if I want to compare the mean of dependent variable between treated unit and the synthetic control units in the post-intervention period? Are there any functions apply to Synth class?

sdfordham commented 1 year ago

The package doesn't provide this right now. But you can use this code e.g. in the West Germany notebook, insert a new code block after the last one and then run it (but the code should work in any case provided you pick a sensible time range and a predictor that exists in the dataprep data)

def summary_post(synth, predictor, time_period):
    dataprep = synth.dataprep
    treated = dataprep.foo[
        dataprep.foo[dataprep.unit_variable] == dataprep.treatment_identifier
    ].set_index(dataprep.time_variable)[predictor]

    controls = dataprep.foo[dataprep.foo[dataprep.unit_variable].isin(dataprep.controls_identifier)]
    controls_piv = controls.pivot(
        index=dataprep.time_variable,
        columns=dataprep.unit_variable,
        values=predictor
    )

    X0, _ = dataprep.make_covariate_mats()

    controls_piv = controls_piv[X0.columns]
    synthetic_ser = (controls_piv * synth.W).sum(axis=1)
    return {
        "treated unit": treated.loc[time_period].mean(),
        "synthetic control": synthetic_ser.loc[time_period].mean()
    }

summary_post(synth, "gdp", range(1990, 2001))

cuusa commented 1 year ago

I tried your code and it works well, I am truly grateful for your support. Pysyncon package has made a tremendous impact on my project, Your willingness to lend a helping hand showcases your remarkable skills and kindness. Once again, thank you from the bottom of my heart for your invaluable assistance. I truly appreciate your time, effort, and expertise.

sdfordham commented 1 year ago

Thank you for your kind words, I'm glad to help.

sdfordham / pysyncon

What‘s the synth.summary() about? #23