py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.65k stars 690 forks source link

Causal Forest DML has very wide confidence interval #893

Open silulyu opened 5 days ago

silulyu commented 5 days ago

Below is my code to estimate treatment effects. There is a much wider confidence interval of ATT (i.e., [-200k, 900k]) by Causal Forest DML model, compared to that calculated by linear DML model (i.e, [200k, 400k]). Are there any ways to make CI by Causal Forest DML narrower, and ideally statistically significant ?

# Linear DML for the ATE
    dml = LinearDML(
        model_y = rf_reg,
        model_t = xgb_class, 
        discrete_treatment = True,
        random_state = 0,
        cv = StratifiedKFold(5))

    print('Fitting linear DML...')
    results_dml = dml.fit(Y=Y,T=T,W=X)
    ate = dml.intercept_
    ate_lb = dml.intercept__interval()[0]
    ate_ub = dml.intercept__interval()[1]

    print('DML ATE:', round(ate, 2), 'CI [', round(ate_lb, 2), ',', round(ate_ub, 2), ']', '$')

# Causal Forest for the ITE
    cf = CausalForestDML(
        model_y = rf_reg,
        model_t = xgb_class, 
        discrete_treatment = True,
        cv=StratifiedKFold(5),
        random_state = 0,
        n_estimators=300,
    )

    print('Fitting causal forest ...')
    results = cf.fit(Y=Y,T=T,X=X,cache_values=True)

    # ITE Estimates with lower and upper bound
    ite_estimates = cf.effect(X)
    lb_estimates, ub_estimates = cf.effect_interval(X)

    # Create DataFrame with individual ITE estimates, lower bound, and upper bound
    all_individual_effects_df = pd.DataFrame({
        'ITE': ite_estimates,
        'ITE_lb': lb_estimates,
        'ITE_ub': ub_estimates
    }, index=df.index)

    # Concatenate with other relevant data
    all_ITEs = pd.concat([df[['sfdc_customer_id']], T, Y, all_individual_effects_df], axis=1)

    # Calculate ATT for treatment group
    att = all_ITEs[all_ITEs[treatment_variable] == 1]['ITE'].mean()
    att_lb = all_ITEs[all_ITEs[treatment_variable] == 1]['ITE_lb'].mean()
    att_ub = all_ITEs[all_ITEs[treatment_variable] == 1]['ITE_ub'].mean()

    # Print ATT results
    print('CF ATT:', round(att, 2), 'CI [', round(att_lb, 2), ',', round(att_ub, 2), ']', '$')
kbattocchi commented 3 days ago

One thing you can try is to use the tune method on the forest before fitting, which should help you set appropriate hyperparameters. Also, you might want to do some model selection for your first stage models to ensure that you're getting the best possible first-stage fits.

However, in general you should expect the confidence intervals for forest-based methods to be wider than those for linear regression - the linear model is much more restrictive and therefore easier to estimate. But keep in mind that the confidence intervals are assuming that the assumptions of the model are met, which means that if the true data-generating process is not linear, then those tighter bounds are not necessarily correct!