Continuous outcome documentation

shaddyab commented 5 years ago

I have a few questions related to working with continuous target variable, please let me know if this is not the right place for my questions because they don’t directly relate to the actual code.

Based on t Athey, S., & Imbens, G. W. (2015) paper, the transformed outcome (Y^) is the Conditional Average Treatment Effect (CART), aka the uplift, for a given Xi. Therefore, if the original outcome column, Y, is continues (e.g., net sale, earning, etc.), then transforming it to Y^, fitting a regression model (e.g., Linear regression, Boosted Decision Tree Regression, etc.), and predicting for a given Xi, the model output can be interpreted as the uplift for a given Xi in the original scale (e.g., uplift in net sale for a given Xi, uplift in earning for a given Xi, etc.). Is this correct?

Now if we decide to apply a data transformation to the original continues outcome column (Only for the samples corresponding with response because the continuous values for non-response are equal to zero), fit a regression model and then predict for a given Xi, does it mean that we have to apply the inverse data transformation for the model output/prediction for it to represent the uplift in the original scale?

In general, can positive predictions be explained as positive uplift compared to negative predictions which indicates negative uplift? or should the model prediction be only perceived as a score/ranking index?

I would recommend improving the documentations regarding continuous outcome and adding an example.

rsyi commented 5 years ago

Short answer: yes, in general, as long as your transformation is monotic and the resulting transformed values remain positive, you're correct and the scale should be correct - i.e. positive values should correspond to positive uplift, negative should correspond to negative uplift.

However, regarding the inversion... yes, you would need to apply the inverse transformation to get the uplift, but there's no guarantee that the inverse transformation would even give you uplift. Whether this works or not will entirely depend on what kind of transformation you're doing on the continuous variable.

For example, let f be your transformation. If avg(f(y)) = f(avg(y)), then you can reverse the transformation and have it represent lift. But often this is not the case, and so when XGBoost averages the values of the transformed outcome in your leaf nodes, you'll have to be careful in reversing it. E.g. with a log transform, the mean becomes a geometric mean, and so it no longer precisely represents the arithmetic average uplift.

rsyi commented 5 years ago

Adding a little bit of documentation about the continuous outcome in this PR: https://github.com/wayfair/pylift/pull/29

Will add more later to the theory section, to discuss best practices.

shaddyab commented 5 years ago

@rsyi Thank you for taking the time to update the documents and answer my earlier questions. I agree that with transformations such as log and Box-Cox the output can’t be easily inverted, however, as long as the transformation is monotonic (e.g., log or box-Cox transformation), we should be able to differentiate between positive output and negative output.

Let’s assume that we are given an imbalanced continuous outcome variable (e.g., net sale) such that for 90% of the samples there is no response. After balancing the data, 50% of the continuous variable samples will have a value equal to zero (i.e., no response), while the other 50% have a long-tail/skewed distribution. In this case, we may decide to apply a log or Box-Cox transformation to the continuous values corresponding with 50% response samples (Mix of positive and negative values), while excluding from the Box-Cox transformation the other 50% of samples corresponding with no response (i.e., continuous variable equal to zero). In this case we will have a zero inflated target continuous variable distribution with a significant mode at 0 (50% of the samples corresponding with no response) and normal distribution of positive values (50% of the samples corresponding response). At this point my approach was to apply the Transformed Outcome (Athey 2016) method for the entire log transformed continuous variable (the zero inflated distribution) and use DT-based approaches to model the continuous variable as a function of all the other independent variables. I also made sure to exclude the binary response variable from the independent variables due to its correlation with the continuous target variable.

Is my approach above correct, or am I missing something? I feel that the zero inflated distribution is negatively affected the tree-based model and the loss function need to be modified accordingly. Should the binary response variable be included as an independent interaction term?

rsyi commented 5 years ago

Short answer: I'd use log(1+x) instead of log, or if you need to use Box-Cox, just remove the -1 I suppose. Personally, I haven't seen many power law distributions in my outcome labels, but I'd be interested to know what data you're looking at!

Long answer: I think, in general, the transformed outcome method doesn't really care about negative values - the proof that it indeed yields lift in expectation doesn't rely on this. However, the problem with using a transformation that produces both positive and negative values is that zero no longer means zero. You now have negative values that were originally positive, and so should yield positive incremental values, but now they're averaging as negative incremental values. If you have only positive values, however, the ranking should remain the same, even with the weird averaging you're doing. So I'd suggest just using a slightly uglier transform, like log1p, to make sure you still have only positive values.

shaddyab commented 5 years ago

The issue as I see it is not with the log, log1p, or Box-Cox Transformation (Assuming: 1) I shifted all my values to make them positive prior to applying the power/log transformation; therefore, the output of Box-Cox/log1p/log transformation are positive values, 2) I am OK with the predicted values not being interpretable as long as they can be ranked 3) Positive and negative model outputs can be interpreted as Positive and negative lift, respectively ) but rather with the zero inflated data and how to balance it.

For example, assuming I want to build a model for uplift in sale and the overall response rate (Control + Treatment) is:

90% No Response
10% Yes Response

Then 90% of values in the continuous target variable in my training data which will be used to build the regression model (e.g., Linear, XGBoost, etc.) are equal to zero (90% no Response ==> 90% no sale ==> 90% of the continuous variable is equal to zero). The transformation proposed by Athey, S., & Imbens, G. W. (2015) will not modify these values and will keep them equal to zero. While the other the 10% of the training data has nonzero, positively skewed distribution, values. Only these 10% of the values will be modified by Athey’s transformation. In this case, the data need to be adjusted using methods such as SMOTE Regression (which I couldn’t find an implementation for it in Python)

In summary: The main issue is with having zero inflated target variable and how to balancing it, and not with the power transformation which was only used to reduce skewness of the positive values correspond with 10% of the overall training data ( In fact, this step can be skipped for tree-based models). I assume that the data need to be balanced to ensure 1:1 response/no response ratio (aka, 50% of the continuous target variable equal to zero, while the other 50% is equal to non-zero.) Should the binary response variable be included as an independent interaction term in a tree-based model? Should the loss function be adjusted? What other solution can be implemented to address this issue?

rsyi commented 5 years ago

Having zero inflation is not a problem, so there's no need to modify the objective function. I don't see any reason why you would want to apply SMOTE, either. The beauty of the transformed outcome object is that maximizing for MSE with the transformed outcome is equivalent to maximizing total uplift (see Hitsch, Misra "Heterogeneous Treatment Effects and Optimal Targeting Policy Evaluation" for a proof, around p. 18).

shaddyab commented 5 years ago

@rsyi Thank you for taking the time to address my questions. Let me double check if I understand your answer correctly. Based on your post above I conclude that even if I have an imbalanced response rate where the binary or continuous target variables are mostly 0, and although the transformed outcome Y^ will still be mostly zero ( Y^ = 0(W-p)/p(1-p) ) with minority samples of negative values when W = 0 ( For Control: Y^ = Y(0-p)/p(1-p) = -Y/(1-p) ) and positive values when W=1 ( For Treatment Y^ = Y*(1-p)/p(1-p) = Y/p ), you still don’t think it will be an issue for the regression model (e.g., linear or gradient boosting, etc.) to handle such a zero inflated distribution of target variable. Based on my research, approaches such as Tweedie Gradient Boosting are used to handle Unbalanced Zero-inflated Data. If this is not the case then I will close this issue.

rsyi commented 5 years ago

Yes, that is correct, as far as I know. While there may be merits to adjusting the objective function, I'd worry because it's not clear if you would still make the splits that maximize lift. SMOTE would be fine to try, if you want, but in my experience it generally doesn't help (IIRC, you can just create a binary variable and use imblearn's implementation).

shaddyab commented 5 years ago

Great! I will close this issue and recommend enhancing the contentious outcome documentation.

wayfair / pylift

Continuous outcome documentation #31