py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.15k stars 935 forks source link

Categorical Covariates in Forest models #471

Open maerory opened 2 years ago

maerory commented 2 years ago

I am using DrOrthoForest to analyze CATE for different populations. Since DrOrthoForest does not support string categorical variable. I am turning them into integers to use as categorical variable.

# DR OrthoForest
Y = np.ravel(df[["target_y"]])
T = np.ravel(df[["treatment"]])
W = df[["income","month"]]
X = df[["sex", "age_group"]]

est = DROrthoForest(n_trees=100, max_depth=5, subsample_ratio=1,
                   propensity_model=GradientBoostingClassifier(),
                   model_Y=GradientBoostingRegressor())
est.fit(Y,T,X=X,W=W)

X_test = np.array(list(itertools.product([0,1], range(10))))
X_test.shape
infer = est.effect_inference(X=X_test)

I want to find CATE for each sex-age_group combination, say that age group is 10. So I am testing with [male(0), 10s(1)], [male(0), 20s(2)] ... [female(1), 50s(4)]. However, I noticed that the inference on excess combination also worked albeit with not so statistically significant result. (eg. [0, 6], [1, 10]) If X was set in the beginning, shouldn't inference only be available within the scope of input combinations? Or am I doing something wrong?

maerory commented 2 years ago

Any updates in the issue? Or is DrOrthoForest not being used as much compared to other models?