uber / causalml

Uplift modeling and causal inference with machine learning algorithms
Other
4.9k stars 762 forks source link

Question: Counterfactual value estimation vs. Regression with meta learners #318

Closed jurgispods closed 2 years ago

jurgispods commented 3 years ago

Hi,

we are in the process of creating personalized promotion campaigns, i.e. displaying vouchers to certain customers visiting our online shop.

I stumbled upon the Counterfactual value estimation notebook, which seems to solve a very similar case (having promotion costs associated with each conversion, e.g. 10% of the transaction value).

However, I am not quite sure I understand the differences between the CounterfactualValueEstimator and, say, a BaseXRegressor. Wouldn't I be able to use a BaseXRegressor with continuous outcome variables which consider the promotions costs within a "net value" outcome?

For example, if I have two customers A and B in my treatment group with 10 dollars transaction value each, while only A used the 10% promotion code, my outcome values would be y_A = 9 and y_B = 10.

I understand that the CounterfactualValueEstimator also enables us to consider promotion costs independent of conversion, but I could easily subtract promotion costs from every treated customer's outcome value as well.

Is there something I am missing here or could I just as well use the base regressor classes for this use case?

t-tte commented 3 years ago

Hello

That’s a great question. The benefit of the counterfactual approach is that it considers what would happen under alternative treatment conditions. For example, whether or not it’s worth targeting someone with a given predicted lift under the treatment condition depends not only on the predicted lift and the associated costs but also on what the conversion probability of that individual would be under the control condition.

For a thorough formal treatment, I recommend this paper by Li and Pearl.

jurgispods commented 3 years ago

Thank you for your explanation and the paper reference. (I am quite new to the causal inference theory and find the counterfactual logic notation hard to digest, but that will hopefully change after reading Peal's "Causal Inference in Statistics" primer book.)

To illustrate my current understanding with a simple conversion (classification) example: Suppose we know the true conversion probabilities of a customer A under each treatment (T=0 control, T=1,2 treatments):

P(Y=1 | T=0) = 0.95
P(Y=1 | T=1) = 0.96
P(Y=1 | T=2) = 0.97

As I understand it, a pure uplift modeling approach gives you an estimate of the uplift with respect to the control group, e.g. the expected difference in conversion probability, while the counterfactual value estimation approach also considers absolute conversion probabilities for each treatment. A perfect uplift model (using a large randomized dataset with predictive features) would therefore give us the following treatment effects:

CATE(T=1) = 0.01
CATE(T=2) = 0.02

and we wold choose treatment 2, as it has the highest uplift.

However, the counterfactual value estimator would also consider the absolute conversion probabilities in each treatment/control setting and use it within the objective function. Depending on the cost of treatments 1 and 2 and on the conversion value, it might not be reasonable to apply treatment 2 to the customer, as he already has very high conversion value in the control setting, so not assigning a treatment at all might be the most reasonable choice. Did I understand the conceptual difference correctly? If that is the case, I have a follow-up question.

Uplift modeling with cutoff value vs. counterfactual value estimation

In practice, I would not "blindly" assign the treatment with highest uplift as predicted by the uplift model to every customer. Instead, I would look at the gain curve for my test data and set a cutoff value c (corresponding to a lower bound for estimated CATE) with the highest cumulative gain.

The Li and Pearl paper mentions the problem, that the gain P(Y=1 | T=1) - P(Y=1 | T=0) maximized by an uplift model does not strictly distinguish between "compliers" and "defiers", since the first term includes "compliers" and "always-takers" and the second term includes "always-takers" and "defiers". However, by choosing a suitable cutoff value to select the customers to treat, my hope would be to include all the compliers and exclude all the always-takers, never-takers and defiers, since compliers have the highest CATE, assuming again good data with predictive features.

I would expect this idea to extend to the regression case, where Y does not only take values 0 or 1, but a continuous outcome, e.g. 0 for non-converting customers, otherwise the transaction value (minus the treatment's conversion cost, if applicable).

So my question is: Given good data and careful model validation to arrive at a robust cutoff value for the estimated CATE, would that make the uplift modeling and the counterfactual value estimator approach equivalent in terms of performance?

Side note

And one more thing regarding the example notebook: It seems a little unfair that the uplift model is only allowed to choose between treatment1 and treatment2, while the counterfactual value estimator is allowed to choose control as well. In particular, when the estimated uplift for both treatments is negative, in practice one would probably choose no treatment at all. When I modify the code to allow for that case, the tm_value goes up from 8.79 to 9.22. That is still significantly lower than the counterfactual value estimator, though.

jurgispods commented 3 years ago

Sorry for the text wall, but here comes yet another question: Suppose I want to operationalize the counterfactual value estimator. For that, I would train an uplift model (for CATE) and a classification model (for conversion probabilities) on offline data.

However, at the moment of prediction on unseen data, the model also needs the treatment, conversion values/costs and impression costs. I could assume constant conversion values (in the context of online promotions, that might be the average transaction value) and costs (e.g. average promotion value). A more sophisticated approach could be to train yet another model for predicting the conversion value for a customer given features X.

Since I don't have a treatment (yet), as that is the variable I'd like to predict, do I simply use the control group as the treatment input?

t-tte commented 3 years ago

Thanks for the interesting comments. You're making a very good point about the uplift curve. In principle, if you have two treatments and perform a very careful analysis of the uplift curve, you might be able to distinguish the compliers from the rest. In practice, and if you have multiple treatment groups, it might be difficult to do this. As regards the prediction on unseen data, I agree with what you're saying about the more sophisticated approach, and as far as I remember it should be fine to set the control as the treatment group. However, if you encountered any issues, let us know. We also need to eventually start updating the documentation around value optimisation methods as it's pretty limited at the moment.