maks-sh / scikit-uplift

:exclamation: uplift modeling in scikit-learn style in python :snake:
https://www.uplift-modeling.com
MIT License
725 stars 96 forks source link

Superfluous treatment feature in a notebook with an example #87

Closed Kxvptev closed 2 years ago

Kxvptev commented 3 years ago

Hello!

When you create a dataset for training, you save the treatment flags in a separate pd.Series, which will be passed to the model during training along with the main X_train. However, you do not drop the treatment column from X_train. This leads to the fact that the model is trained on two treatment features at once. During the application of the model, we are dealing with data for which we do not know whether there was a communication with the client or not, we need to calculate the uplift and use it to judge the expediency of communication with the client. At the stage of application, treatment is set to 0 and 1 and the difference between the model predictions for them is calculated. However, if you do not drop treatment from X_train during training, during the application of the model on data with the same features, but without treatment, at least an error will be received about the lack of one feature in the data. But the main problem is the discrepancy with the logic, which corresponds to the theory described in the same article on scikit-uplift. Please correct this small but serious typo in your example. At least in my case, it became a bottleneck in my work.

Sincerely, Gleb Koptev!

maks-sh commented 3 years ago

Hi!

Thank you for your feedback!

Wow, this is a major mistake.

Could you tell in which examples this is done?

Kxvptev commented 3 years ago

RetailHero.ipynb and RetailHero_EN.ipynb. In blocks numbered 3 there are not lines of code where the treatment columns would drop

maks-sh commented 3 years ago

Sorry for the long answer.

X_train is created from df_features, which doesn't have a treatment column:

image

Could you please provide the code that reproduces the error you described?