wayfair / pylift

Uplift modeling package.
http://pylift.readthedocs.io
BSD 2-Clause "Simplified" License
368 stars 76 forks source link

Memory Error & NIV Dictionary Query #19

Open khaashif opened 5 years ago

khaashif commented 5 years ago

Hi,

I was hoping you guys could please help me!

My dataset is around 270k rows with around 60 variables, its a 60mb file, when I try to call NIV, I run into memory error. To tackle this, I have ran NIV on a sample of around 180k max with success, I then refer to the sample's NIV dictionary and select all variables with a NIV higher than 0.03 for example. I then select these variables from my 270k dataset and build my model on this.

However, my cumulative gains plot always ends up with a negative correlation and cgains line plotted below the random selection line.

My guess is this problem is as a result of two possible things:

Possible solutions/questions I have:

Are there any other solutions you guys would recommend? Apologies - I am fairly new to machine learning in general so still learning alot!

Many thanks for your help in advance, its much appreciated!

Khaashif

rsyi commented 5 years ago

I'd have to revert to the expertise of @WTFrost on the memory requirement of NIV as he built the module. I'll look into what the numbers mean and get back to you.

Because the NIV and NWOE modules are sort of just crude dimensional cuts for EDA, I'd actually recommend not using them for feature selection. We never really came up with a great way of selecting features, though, to be honest. I'd often just look at the feature importances from xgboost after making my first model with all the features (on both "gain" and "weight"), and pick the most important of these from each category.

If any uplift exists, you can usually find it by creating an outcome model on the treatment group. I'd try that first, and ensure that you can get a positive cumulative gains curve when ranking by this outcome prediction. If not, then it probably is a problem with your dataset (your features just may not be predictive enough). Outcome models made in this way should be able to find all "persuadables", but you'll just end up mixing some "sure things" in there. As long as you have some "lost causes" or "sleeping dogs" in the population, such a model should be able to at least pick them out. They're also a lot more stable to create than uplift models. You just risk overspending if you use them for targeting. You can use the UpliftEval class to evaluate the performance of such a model.

rsyi commented 5 years ago

Ah! There was a bug in the NIV() routine causing the incorrect numbers. Sorry about that. Should be fixed now.

I'll look into the memory issue now.

narmin-a commented 4 years ago

Hi! Great package, thanks for open-sourcing it. The model predicts well for small datasets, but I'm running into memory error with up.randomized_search() command for larger datasets (>200K rows, 10 features). Any suggestions on how it can be solved? Or why it is happening?