Continuous Response Variable: P-values - Githubissues

uber / causalml

Uplift modeling and causal inference with machine learning algorithms

Other

4.98k stars 767 forks source link

Continuous Response Variable: P-values #244

Closed soodimilanlouei closed 3 years ago

soodimilanlouei commented 3 years ago

I'm training a random forest model, where the response variable is continuous. When I look at one tree from the forest, the p-values are always NaN. Why is that?

paullo0106 commented 3 years ago

Thanks for reaching out, are you using UpliftRandomForestClassifier? Currently, uplift tree only supports classfication, it's on the road map to cover regression use case in the future.

soodimilanlouei commented 3 years ago

Thanks for the response. Yes, I'm using UpliftRandomForestClassifier. When I use UpliftTreeClassifier for the continuous response variable, it raises an error (the tree is empty). However, interestingly, when I run UpliftRandomForestClassifier, it trains the model and I can plot the gain and lift graphs as well. I also can visualize one of the trees in the forest; however, p-values are always NaN. So, I assume I should not trust these results, right?

paullo0106 commented 3 years ago

Correct, this is not from the accurate Regression tree implementation.

soodimilanlouei commented 3 years ago

In this example, shouldn't the base learners for R-Learner be RandomForestClassifier instead of RandomForestRegressor since the response variable is binary (Conversion)?

paullo0106 commented 3 years ago

I think you're right, thanks for flagging that! Ideally, should use synthetic_data() to generate dataset with continuous target variable rather than binary in feature_selection.ipynb example notebook for Regressors.

soodimilanlouei commented 3 years ago

Another question that I have is regarding the arguments of BaseXClassifier function. In this link, it says that:

_"outcomelearner (optional): a model to estimate outcomes in both the control and treatment groups. Should be a regressor." _"effectlearner (optional): a model to estimate treatment effects in both the control and treatment groups. Should be a classifier."

Shouldn't this be the other way around considering that we are dealing with a classification problem? outcome_learner to be a classifier and effect_learner to be a regressor? In lines 670 and 672, function predict_proba is called which is defined for classifiers and not for regressors and since the probability of belonging to class 1 is subtracted from the actual Y, the new response variables (d_c and d_t) is continuous and needs a regressor for fitting.

paullo0106 commented 3 years ago

Your points make sense to me that the docstring for parameters of BaseXClassifier needs some correction, predict_proba() invocation is for classfiers. @ppstacy @jeongyoonlee, or others, can you confirm that for an X-Learner classifier, the outcome learner (M1 and M2) is classifier and the effect learner (M3 and M4) is regressor?

Screen Shot 2020-11-14 at 4 05 35 PM

jeongyoonlee commented 3 years ago

Thanks @soodimilanlouei and @paullo0106. Yes, in BaseXClassifier, outcome_learner should be a classifier while effect_learner should be a regressor. Please feel free to submit a PR. Thanks again!

paullo0106 commented 3 years ago

the arguments documentation part was fixed in PR #251