maks-sh / scikit-uplift

:exclamation: uplift modeling in scikit-learn style in python :snake:
https://www.uplift-modeling.com
MIT License
707 stars 95 forks source link

Request for contribution: Including Interaction terms (X*T) for SoloModel #12

Closed AdiVarma27 closed 4 years ago

AdiVarma27 commented 4 years ago

Hey ! I was wondering if I could contribute to scikit-uplift by including an additional parameter to the SoloModel class.

According to the paper (Lo, Victor. 2002. The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing. SIGKDD Explorations. 4. 78-86.), looking at equation (6), which takes the general form, interaction terms are included in the model.

New changes would have the following:

sm = SoloModel(CatBoostClassifier(verbose=100, random_state=777), treatment_interaction=False)

Kindly let me know your thoughts.

maks-sh commented 4 years ago

Hi! I would really appreciate your contribution to the library 👍

Just a few suggestions: first, I think it would be better not to define a parameter treatment_interaction, but define a parameter method, similarly with TwoModels. So, the approach that is implemented now will be called dummy (_SoloModel(CatBoostClassifier(verbose=100, randomstate=777, method='dummy')), and the new one will be called treatment_interaction ((_SoloModel(CatBoostClassifier(verbose=100, random_state=777, method='treatmentinteraction'))). It would be better in terms of extensibility in the future. Also, a very similar approach is called SDR (shared data representation) and is described in the article 2. Artem Betlei, Criteo Research; Eustache Diemert, Criteo Research; Massih-Reza Amini, Univ. Grenoble Alpes Dependent and Shared Data Representations improve Uplift Prediction in Imbalanced Treatment Conditions FAIM'18 Workshop on CausalML. Second, are there any suggestions about using not only Logistic Regression estimator (like in the papers) but also a Tree based estimator? I feel like it is a kind of research question.

AdiVarma27 commented 4 years ago

Hey ! Thanks for your inputs.

I shall include a parameter method, similar to TwoModels, which works well for ease of extensibility.

1). I shall look into SDR (shared data representation) from the reference you provided and include it once I completely test it on my end.

2). Speaking only about single model, According to (Lo, Victor. 2002. The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing. SIGKDD Explorations. 4. 78-86.),

In fact, equation (1) is a general supervised learning model form as f(.) may be nonlinear or other complicated functions such as step-functions (e.g. decision trees such as CART and CHAID, see [6;28]), splines ([36;38]), composite functions (e.g. multi-layer perception in neural networks [5;12]), other neural network models (e.g. [23;35]), mixture models (e.g. [37;12]), Bayesian models (e.g. [13;20]), or hybrids (e.g. [11;14;26]).

Hence, we could technically pass in ANY Supervised Model., as long as we have an assigned Propensity (predict_proba() method).