Weird result with CausalForestDML

Tommaso11397 commented 2 years ago

Hi, I am experimenting with the great EconML package but I am having some weird results with the function CausalForestDML. I am using the same code from this notebook with data from Duflo, Dupas and Kremer (2011) Peer Effects, Teacher Incentives, and the Impact of Tracking, a clustered RCT in Kenya.

I wanted to compare the ATE results for CausalForestDML and ForestDRLearner, while the latter method provides a plausible estimate of the ATE (0.13 against the 0.11 Neyman estimator), the former gives very weird estimates of the ATE (from 500 standard deviations to 0.35). I was able to figure that the issue seems to be in the interaction between CausalForestDML and the model for the propensity score chosen by GridSearchCVList (in my case the GradientBoostingClassifier). If I use as the propensity score model the GradientBoostingClassifier I recover the weird ATE of 0.35, instead if I use a RandomForestClassifier I get a better estimate of 0.15, lastly if I use the standard LogisticRegression I get an ATE of 0.12.

I guess my question is why does this happen? Why does the "best" model, according to GridSearchCVList, performs so poorly, while the "worst" performs the best when used with CausalForestDML? And lastly, why does this weird phenomenon not happen with the ForestDRLearner?

I attach the data and the code I am using, thanks a lot for any feedback. This package is awesome and it will be of great help for many.

Merged_nopercentile_nomissing.csv

import pandas as pd
import numpy as np
np.set_printoptions(threshold=np.inf)
np.random.seed(11397)

#%% Import data
df = pd.read_csv('./Merged_nopercentile_nomissing.csv')

X = df.loc[:,'girl':'district2']
X.drop('percentile', inplace=True, axis=1)
T = df.loc[:, 'tracking']
Y = df.loc[:, 'totalscore_std']

from econml.sklearn_extensions.model_selection import GridSearchCVList
from sklearn.linear_model import LassoCV, LogisticRegressionCV, Lasso, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.base import clone

def first_stage_reg():
    return GridSearchCVList([Lasso(),
                             RandomForestRegressor(n_estimators=1000, random_state=123),
                             GradientBoostingRegressor(n_estimators=1000 , random_state=123)],
                             param_grid_list=[{'alpha': [.001, .01, .1, 1, 10]},
                                               {'max_depth': [3, None],
                                               'min_samples_leaf': [10, 50]},
                                              {'n_estimators': [50, 100],
                                               'max_depth': [3],
                                               'min_samples_leaf': [10, 30]}],
                             cv=5,
                             scoring='neg_mean_squared_error')

def first_stage_clf():
    return GridSearchCVList([LogisticRegression(),
                             RandomForestClassifier(random_state=123),
                             GradientBoostingClassifier(random_state=123)],
                             param_grid_list=[{'C': [0.01, .1, 1, 10, 100]},
                                              {'max_depth': [3, 5],
                                               'min_samples_leaf': [10, 50]},
                                              {'max_depth': [3],
                                               'min_samples_leaf': [10, 30]}],
                             cv=5,
                             scoring='neg_mean_squared_error')

model_y = clone(first_stage_reg().fit(X, Y).best_estimator_)
model_t = clone(first_stage_clf().fit(X, T).best_estimator_)

from econml.dml import CausalForestDML
est = CausalForestDML(model_t = model_t, #LogisticRegression(),
                      model_y = model_y, #LassoCV(),
                      discrete_treatment=True,
                      cv=5,
                      n_estimators= 2000,
                      random_state=123)
est.tune(Y, T, X=X).fit(Y, T, X=X, cache_values=True)
#est.summary()
ate = est.ate(X)

from econml.dr import ForestDRLearner
est2 = ForestDRLearner(model_propensity = model_t, #GradientBoostingClassifier(),
                      model_regression = model_y, #LassoCV(),
                      cv=5,
                      n_estimators= 2000,
                      random_state=123).fit(Y, T, X=X, cache_values=True)
ate2 = est2.ate(X)

kbattocchi commented 2 years ago

Thank you for the very clear bug report; I can reproduce the issue you're seeing but don't have an immediate explanation, I'll spend some more time digging into it to see what's going on.

Tommaso11397 commented 2 years ago

Thank you, let me know if you find something.

vsyrgkanis commented 2 years ago

Try min_var_leaf_on_val=True and min_var_fraction_leaf=0.05 (or sth like that).

causal forest involves an inversion of a local matrix defined by the forest neighborhood (unlike forestdrlearner). The default params do not impose a constraint on this local matrix to have strictly positivr eigenvalues. These eigenvalyes relate to the local variance of the treatment (check out the docstring of these params).

these params ensure such a lower bound and might be the reason why it blows up (you create local neighborhoods where treatment is deterministic).

Tommaso11397 commented 2 years ago

Thanks for the suggestion @vsyrgkanis, I ran the previous code incorporating your suggestion:

est = CausalForestDML(model_t = model_t, #LogisticRegression(),
                      model_y = model_y, #LassoCV(),
                      discrete_treatment=True,
                      cv=5,
                      n_estimators= 2000,
                      random_state=123,
                      min_var_leaf_on_val= True,
                      min_var_fraction_leaf= 0.05)
est.tune(Y, T, X=X).fit(Y, T, X=X, cache_values=True)

Unfortunately, the result does not change, do you have any other suggestions or ideas?

Tommaso11397 commented 2 years ago

Hi everyone, I managed to solve the issue and I think it might be useful for future users. I am quite sure the issue was created by the XGBoost classifier which clearly overfitted the data and ended up predicting values of the propensity score very close to 0 or 1 by simply picking up random correlations in the data (since randomization is quite good in the original data). Therefore, I was able to solve the problem by reducing the number of estimators and setting the max depth of the weak learners to 1, instead of 3.

def first_stage_clf():
    return GridSearchCVList([LogisticRegression(),
                             RandomForestClassifier(random_state=123),
                             GradientBoostingClassifier(random_state=123)],
                             param_grid_list=[{'C': [0.01, .1, 1, 10, 100]},
                                              {'max_depth': [3, 5],
                                               'min_samples_leaf': [10, 50]},
                                              {'n_estimators': [50, 100],
                                               'max_depth': [1],
                                               'min_samples_leaf': [10, 30]}],
                             cv=5,
                             scoring='neg_mean_squared_error')

I feel that overfitting is quite crucial, especially for propensity score estimation, because you want to distinguish real pscore that are symptoms of selection into treatment from actual overfitting that comes from the ML model picking up random noise. The point is crucial because especially with RCTs, usually the sample size is not too big and in general one would like to estimate heterogeneity on the whole sample. In this case overfitting arose even with Cross Validation, which is quite interesting.

stoddardg commented 2 years ago

@Tommaso11397 I found this example/discussion very useful, so thanks for posting. I'm not sure how possible this is with the current code but did you try splitting by cluster when doing the cross-validation? In other words, I think you would want a cross-validation scheme where all data from a single site is either in train set or in the test set, but never both. In my experience, having data from the same cluster in both the train and test can cause the hyperparameter search algorithm (GridSearch in this case) to favor very deep and complex models because they learn to fit to idiosyncratic features of each site. But when you cross-validate by cluster, the search algorithm ends up choosing much simpler models that don't end up over-fitting nearly as much.

py-why / EconML

Weird result with CausalForestDML #584