py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.77k stars 713 forks source link

Multiple treatments when using grf.CausalForest #514

Open WEICHENGIT opened 3 years ago

WEICHENGIT commented 3 years ago

Hi,

This is more of a question concerning the grf module rather than an issue. We tried to use grf.CausalForest to estimate the heterogeneous causal effect with multiple treatments. In our case, the treatments are the coupons with different amount sent to customers, then should the parameter T be a one-hot encoding matrix, or just an array with the coupon amount?

Thx!

liukanglucky commented 3 years ago

The same question. In grf.CausalForest T should be a matrix with size (n_samples, n_treatments) , if we have multiple discrete treatments how should the parameter T be set, and if we have multiple continuous treatments ?

heimengqi commented 3 years ago

@WEICHENGIT In terms of your question, if I understand it correctly, you want to estimate the treatment effect of coupon amount, then we could consider it as a continuous treatment and just input an array with this variable.

In addition, we always recommend you to use this CausalForest as the final stage for CATE estimation and combine with Double Machine Learning framework when you want to learn HTE, it will residualize the outcome and treatment first and fit CausalForest on residuals. You could use CausalForestDML as below:

est = CausalForestDML(cv=2,
                      criterion='mse', n_estimators=400,
                      min_var_fraction_leaf=0.1,
                      min_var_leaf_on_val=True,
                      verbose=0, discrete_treatment=False,
                      n_jobs=-1, random_state=123)
heimengqi commented 3 years ago

@liukanglucky Similarly to the answer above, we always recommend you to use this CausalForest as the final stage for CATE estimation and combine with Double Machine Learning framework when you want to learn HTE, it will residualize the outcome and treatment first and fit CausalForest on residuals.

when you have multiple discrete treatment, just input the raw T and set discrete_treatment=True when initiate the estimator, internally we will do one-hot-encoding for you; when you have multiple continuous treatments, then input the matrix with size (n_samples, n_treatments)

Here is the sample code:

# discrete treatment
est = CausalForestDML(discrete_treatment=True)
est.fit(Y, T, X, W) # T is the array of discrete treatment

# continuous treatment
est = CausalForestDML(discrete_treatment=False)
est.fit(Y, T, X, W) # T is matrix with size (n_samples, n_treatments)
liukanglucky commented 3 years ago

@liukanglucky Similarly to the answer above, we always recommend you to use this CausalForest as the final stage for CATE estimation and combine with Double Machine Learning framework when you want to learn HTE, it will residualize the outcome and treatment first and fit CausalForest on residuals.

when you have multiple discrete treatment, just input the raw T and set discrete_treatment=True when initiate the estimator, internally we will do one-hot-encoding for you; when you have multiple continuous treatments, then input the matrix with size (n_samples, n_treatments)

Here is the sample code:

# discrete treatment
est = CausalForestDML(discrete_treatment=True)
est.fit(Y, T, X, W) # T is the array of discrete treatment

# continuous treatment
est = CausalForestDML(discrete_treatment=False)
est.fit(Y, T, X, W) # T is matrix with size (n_samples, n_treatments)

@heimengqi Thank you for your reply ! I have tried this method, but there will be serious over fitting. (Qini-Curve was be used to evaluate, the training set is much better than the test set. ) My task is to identify the user's sensitivity to coupons,used the data with random coupon amount and real feedback. I tried adjust parameters (min_samples_leaf, min_samples_split, max_depth, n_estimators ... ), but it didn't work. Also I tried S-Learner ( just use coupon amount as one feature ), it seems better than CausalForestDML.

vsyrgkanis commented 3 years ago

Try setting a threshold on ‘min_var_fraction_leaf=.1’ and ‘min_var_leaf_on_val=True’

liukanglucky commented 3 years ago

Try setting a threshold on ‘min_var_fraction_leaf=.1’ and ‘min_var_leaf_on_val=True’

@vsyrgkanis I've try, but it doesn't work.

WEICHENGIT commented 3 years ago

Try setting a threshold on ‘min_var_fraction_leaf=.1’ and ‘min_var_leaf_on_val=True’

Hi,

We(@liukanglucky)'ve tried to tune the parameter _min_var_fractionleaf and other parameters to avoid overfitting, but sadly the DMLCF model was still heavily over-fitted on our data.

We doubt whether this is a common situation universally existing in causal models, or our data was ill collected. We noticed that in the use case of Causal Forest and Orthogonal Random Forest Examples.ipynb, real data has been applied. Anyone checked if the model was over-fitted on this dataset?

Thx.

jbel1026 commented 3 years ago

@WEICHENGIT I am also getting significant over-fitting when using multiple binary treatments

superpig99 commented 2 years ago

In my case, once the max_depth of CausalForest is more than 13, there will be obvious over fitting. Maybe modifying the depth will work.🤔️

vferraz commented 1 year ago

Is there still no solution for modeling multiple discrete treatments?