Input data with imbalanced outcome class?

wayfair / pylift

Uplift modeling package.

http://pylift.readthedocs.io

BSD 2-Clause "Simplified" License

368 stars 76 forks source link

Input data with imbalanced outcome class? #28

Closed sherrywang15 closed 5 years ago

sherrywang15 commented 5 years ago

New to transformed outcome method but found the package super interesting!

How does the method work with imbalanced outcome data though? In many marketing use cases, the proportion of customers who buy tend to be very low, so transformed y^{*} will be 0 most of the time. How does that affect the model?

Can we possibly take a resampling approach for training data? However, if we sample a, say 1:1 balanced outcome dataset, how should we estimate the treatment policy p?

Would love to hear your thoughts on if pylift can consider such data sets.

rsyi commented 5 years ago

Hi @sherrywang15, glad you've enjoyed the package so far!

The treatment policy is not about the outcome ratio, but about the treatment ratio, and should be independent of how imbalanced your dataset it. p is generally just constant - the percentage of rows that were given the treatment. It's also only really mandatory to manually specify if it's variable -- e.g. if you only gave the treatment to 10% of population A, but 50% of population B.

That said, you can absolutely subsample the negative cases (I often do the same thing). The only thing that will change is the scale of your qini curves. And I'd also guess that your hyperparameters will change if you change the outcome ratio...

shaddyab commented 5 years ago

I have a follow up to the previous question. Assuming I have a training dataset with the following distribution Control Vs Treatment groups

Treatment 90%
Control 10%

Response Rate

Yes = 15%
No: 85%

Do I need to balance both the Control/Treatment group and the Response rate distributions prior to model building, or is it enough to balance the Response rate to make it 50/50 while keeping the Control/Treatment distributions as is? Does the distribution of the Control/Treatment groups in test and production need to be identical to the distribution used during model building? ( i.e., if the group distribution in production doesn't match the training group distribution then the model is not valid)

sherrywang15 commented 5 years ago

@shaddyab I don't think you need to balance Treatment/Control group, the % of treatment would just be your treatment policy, so p=0.9 in your example and the transformed outcome will be based on it. I think your test data should be based on the original distribution, while training data can be class adjusted.

@rsyi I found out ways to supply your own train/test data and specify p instead of calculating based on input dataframe through reading the code. Any plan to make the documentation more explicit?

rsyi commented 5 years ago

What sherry said is correct! Thanks!

@sherrywang15 Yes absolutely. We baked in a lot of extra functionality like this, but it only exists in the docstrings (actually, in the case of the custom train/test split specification, I don't think that's even in the docstring). Perhaps open an issue to make sure we get to it?

https://pylift.readthedocs.io/en/latest/py-modindex.html

sherrywang15 commented 5 years ago

Sounds good. Will close this issue and open a new one about documentation.