py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.73k stars 706 forks source link

Understanding on Discrete Treatment (p>2) Inference with CausalForestDML #496

Open AllardJM opened 3 years ago

AllardJM commented 3 years ago

Hello!

I didn't see any examples where there existed a discrete treatment with multiple values (>2) and a binary outcome. I am hopeful someone can confirm my understanding.

This data set is from a marketing campaign where customers received one of three treatments (https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html):

The outcome I chose here is if the customer visited after the campaign, or not.

Lets say the research question was if the treatment effect depended on the customers prior purchase categories (of which Mens and Womens are binary values in the data)

Here I am setting the treatment to a numeric (1,2,3) for the three categories and using a regression wrapper function to overcome the fact that econml doesnt natively support non-numeric outcomes.

import econml
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from econml.dml import CausalForestDML
from sklearn.model_selection import train_test_split
import xgboost

import warnings
warnings.filterwarnings("ignore")

from sklearn.base import BaseEstimator, clone

class RegressionWrapper(BaseEstimator):

    def __init__(self, clf):
        self.clf = clf

    def fit(self, X, y, **kwargs):
        self.clf_ = clone(self.clf)
        self.clf_.fit(X, y, **kwargs)
        return self

    def predict(self, X):
        return self.clf_.predict_proba(X)[:, 1]

# read data and create indicator variables    
dat = pd.read_csv('http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')

dat['phone'] = np.where(dat.channel == 'Phone',1,0)
dat['web'] = np.where(dat.channel == 'Web',1,0)
dat['multi'] = np.where(dat.channel == 'Multichannel',1,0)

dat['suburban'] = np.where(dat.zip_code == 'Suburban',1,0)
dat['rural'] = np.where(dat.zip_code == 'Rural',1,0)
dat['urban'] = np.where(dat.zip_code == 'Urban',1,0)

# treatment 
dat['test_numeric'] = 3  # womens
dat['test_numeric'] = np.where(dat.segment == 'No E-Mail',1,dat['test_numeric'].values) # control
dat['test_numeric'] = np.where(dat.segment == 'Mens E-Mail',2,dat['test_numeric'].values) # mens

# train / test split
X_train, X_test, y_train, y_test = train_test_split(dat.drop('visit',axis=1), dat[['visit']], test_size=0.50, random_state=42)

# treatment, confounders / nusiance and two variables of interest
T = X_train['test_numeric']
W = X_train[['phone','web','multi','history','recency']]
X = X_train[['mens','womens']]

# outcome
Y = y_train

#model for the treatments
xgb_model_mc = xgboost.XGBClassifier(objective="multi:softmax", num_class =3, random_state=42)
# model for the outcome
xgb_model = xgboost.XGBClassifier(objective="binary:logistic", random_state=42)

causal_forest = CausalForestDML(criterion='het', 
                                n_estimators=5000,       
                                min_samples_leaf=10, 
                                max_depth=5, 
                                max_samples=0.5,
                                discrete_treatment=True,  # discrete treatments
                                honest=True,
                                inference=True,
                                cv=10,
                                model_t=xgb_model_mc, # model to use for treatments
                                model_y=RegressionWrapper(xgb_model), # model for y
                                )

# fit train data to causal forest model 
causal_forest.fit(Y = Y.values , T = T.values, X = X.values, W = W.values)

The inference for the treatment effect of Womans email versus no email is here (Mens would be simiiar)

#treatment effect (womens email - no email) when the customers purchased......
# 1) only from womens and not mens 
# 2) both womens and mens 
# 3) only mens
# 4) neither 

X = np.array([[0,1],[1,1],[1,0],[0,0]])
infer_result = causal_forest.effect_inference(X =X,T0 =1 , T1 =3 )
result_pd = infer_result.summary_frame()
result_pd.index=['Only Womens', 'Both Mens and Womens', 'Only Mens', 'Neither']
result_pd

and the result:

image

image

Is this the proper way to conduct this analysis using Casual Forest?

jaydeepchakraborty commented 1 year ago

@AllardJM , I am also confused how I can interpret the heterogeneous treatment effect point estimate values.

In your example, Treatment is categorical. 'No E-Mail' - 1 'Mens E-Mail' - 2 'WoMens E-Mail' - 3

In your inference, infer_result = causal_forest.effect_inference(X =X,T0 =1 , T1 =3 )

From this discussion issue-676 The treatment effect is the estimated average effect on Y from moving from T=1 to T=3, given X.

Let's consider, first test sample - X.iloc[[0]], the point estimate is 0.074

If we want to describe the effect on this first test sample, if the treatment is changed from T=1 ('No E-Mail') to T=3 ('WoMens E-Mail') then the Y "customer visit" will be increased by 0.074.

Is it correct understanding ?