py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.74k stars 708 forks source link

Endogeneity issues #287

Open federiconuta opened 3 years ago

federiconuta commented 3 years ago

Hi all and thanks for the nice work you are doing.

I am working on a panel database, and would like to have info about how the algorithm deals with endogeneity in a panel data structure. In particular, I have applied the algorithm with your help using SparseCATE routine, debiased inference and accounting also for time, group and individual fixed effects. My question is whther the method considers the possibility of a relationship between features/treatment and possible unobservables.

Checking the online documentation I have seen that there is a Z called instrument. How is it employed in thee algorithm? Do I need to use it or I have already taken into account possible endogeneity by using the mentioned routines?

Thank you in advance,

Federico

federiconuta commented 3 years ago

To provide you a quick example of some of the difficulties I am meeting I just copy-pasted the example in the code section for DML and made just very few modifications to adapt to the DMLIV estimator:

# Main imports
from econml.dml import DMLCateEstimator, LinearDMLCateEstimator,SparseLinearDMLCateEstimator  

# Helper imports
import numpy as np
from itertools import product
from sklearn.linear_model import Lasso, LassoCV, LogisticRegression, LogisticRegressionCV,LinearRegression,MultiTaskElasticNet,MultiTaskElasticNetCV
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split

%matplotlib inline

# Treatment effect function
def exp_te(x):
    return np.exp(2*x[0])

# DGP constants
np.random.seed(123)
n = 6000
n_w = 30
support_size = 5
n_x = 5
# Outcome support
support_Y = np.random.choice(np.arange(n_w), size=support_size, replace=False)
coefs_Y = np.random.uniform(0, 1, size=support_size)
epsilon_sample = lambda n: np.random.uniform(-1, 1, size=n)
# Treatment support
support_T = support_Y
coefs_T = np.random.uniform(0, 1, size=support_size)
eta_sample = lambda n: np.random.uniform(-1, 1, size=n)

# Generate controls, covariates, treatments and outcomes
W = np.random.normal(0, 1, size=(n, n_w))
X = np.random.uniform(0, 1, size=(n, n_x))
# Heterogeneous treatment effects
TE1 = np.array([x_i[0] for x_i in X])
TE2 = np.array([x_i[0]**2 for x_i in X]).flatten()
T = np.dot(W[:, support_T], coefs_T) + eta_sample(n)
Y = TE1 * T + TE2 * T**2 + np.dot(W[:, support_Y], coefs_Y) + epsilon_sample(n)
# Generate test data
X_test = np.random.uniform(0, 1, size=(100, n_x))
X_test[:, 0] = np.linspace(0, 1, 100)

At this point I constructed a simple DMLIV estimator together with a DMLATEIV and the one provided by the example:

T = T.reshape(-1,1)
est = DMLATEIV(model_Y_X = ElasticNetCV(),
                model_T_X = MultiTaskElasticNetCV(),
                model_Z_X = MultiTaskElasticNetCV(),
                discrete_treatment=False, discrete_instrument=False, n_splits = 2)

#est = LinearDMLCateEstimator(model_y=GradientBoostingRegressor(n_estimators=100, max_depth=3, min_samples_leaf=20),
#                             model_t=MultiOutputRegressor(GradientBoostingRegressor(n_estimators=100,
#                                                                                    max_depth=3,
#                                                                                    min_samples_leaf=20)),
#                             featurizer=PolynomialFeatures(degree=2, include_bias=False),
#                             linear_first_stages=False,
#                             n_splits=5)

#est = DMLIV(model_Y_X = ElasticNetCV(cv=[(fold00, fold11), (fold11, fold00)]),
#                model_T_X = MultiTaskElasticNetCV(cv=[(fold00, fold11), (fold11, fold00)]),
#                model_T_XZ = MultiTaskElasticNetCV(cv=[(fold00, fold11), (fold11, fold00)]),
#                model_final = ElasticNetCV(),
#                discrete_treatment=False, discrete_instrument=False, n_splits = [(fold0, fold1), (fold1, fold0)], random_state=None)

#est.fit(Y, np.concatenate((T, T**2), axis=1), X, W, inference='statsmodels')
est.fit(Y, np.concatenate((T, T**2), axis=1), X, W)

Now, while all the three estimators seem to work until this point, the problem arises with IV estimators when I try to call consist_marginal_effects:

est.const_marginal_effect(X_test).shape

the error being:

Dimension mis-match of X with fitted X It seems to work with DMLIV when I type

est.const_marginal_effect(W).shape

with output a (6000,2) array

kbattocchi commented 3 years ago

To answer your initial question, some of our methods assume that there are no unobserved confounders; these include all of the estimators in the econml.dml package.

Other estimators allow unobserved confounders, but require the use of an instrument (these include the estimators in the econml.ortho_iv package). Just to be clear, there are strong requirements on what it means to be a valid instrument: it needs to be an observed variable that directly affects the treatment assignment T but does not directly affect the outcome Y (but does implicitly affect it via the effect on T). This is the extra structure that allows us to estimate the treatment effect even when there may be other unobserved confounders that may affect both T and Y.

I'll take a closer look at the error you're seeing in your follow-up.

kbattocchi commented 3 years ago

Here's at least one thing that's going wrong - our documentation for DMLATEIV's fit method is not quite right - because it is computing an average treatment effect (rather than a heterogeneous effect) it takes controls W but not features X, and therefore no X should be passed to the effect methods. Note that the documentation correctly shows that the third argument to fit should be Z but you're passing X there instead. I'd recommend passing arguments by name to make sure that things are being interpreted as you expect (e.g. fit(Y=Y, T=T, Z=Z, W=W)).

federiconuta commented 3 years ago

Thanks @kbattocchi for the prompt reply. Actually it is so and I modified the code with the following W_test:

W_test = np.random.uniform(0, 1, size=(100, 30))
W_test[:, 0] = np.linspace(0, 1, 100)

But what now I am getting when doing

m_eff = est.const_marginal_effect(W_test)

is a vector with right dimensions (100,2) but containing all zeros. I don't know if this is because I am assuming a random instrument or because there is something wrong in the code actually. Could you please help me out figuring out this?

federiconuta commented 3 years ago

@kbattocchi A last question before closing. Does DMLIV estimate heterogenous effects or ATE as we'll as DMLATEIV?

Thank you