microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.61k stars 3.83k forks source link

Make binning optional #876

Closed AbdealiLoKo closed 7 years ago

AbdealiLoKo commented 7 years ago

In some cases, it is very critical to use the exact solution of GBM which it theoretically guarantees.

Not allowing the user to decide whether he wants binning or not is a deal breaker as some folks may not want the quantile sketch based binning

guolinke commented 7 years ago

@AbdealiJK The whole idea of LightGBM is based on binning. So it is not possible without binning. You can use large max_bin , which is like the non-binning.

However, the parameter of binning actually doesn't have much effects on accuracy. for the accuracy, it doesn't have much loss when using binning. And binning will bring the better generalization power.

AbdealiLoKo commented 7 years ago

Is there a limit on the max_bin parameter?

I have not tried this yet, but I am guessing if I have lets say terabytes of RAM and use a max_bin=numRecords for some data which is very large (billion rows) ... this should be giving me the exact same result as no binning at all ?

guolinke commented 7 years ago

@AbdealiJK I think it is better to smaller than 65535 (2^16 - 1) . using num_redcords is absolutely not needed.

large max_bin will be much slower, and may hurt the test accuracy.

Laurae2 commented 7 years ago

This reduces the number of hyper parameters that need to be estimated

It shouldn't be a hyperparameter to tune, but a parameter to define depending on the memory pressure your computer can handle. If your model takes 200GB in memory and you have only 150GB RAM, reducing the number of bins (squared) will reduce the number of bits required. Using a minimal amount of bins will obviously reduce the performance of the model, while using a larger amount of bins will automatically converge to exact GBMs' performance.

This can be used to compare techniques and libraries easily

Binning is not made the same way depending on libraries, for instance xgboost fast histogram and LightGBM binnings have nothing in common other than they deliver bins.

In addition, there is no issue comparing (for instance) xgboost exact and xgboost fast histogram / LightGBM: the former is tailored for exactness, while the latter are for speed.

For single machine tasks it could be more accurate that binning

In most cases binning will provide higher performance.

Not allowing the user to decide whether he wants binning or not is a deal breaker as some folks may not want the quantile sketch based binning

The user can use xgboost exact in this case, at the cost of 10x slower speed (and even more, on very large datasets). If the user wants to use LightGBM without binning, this is not what LightGBM is made for.

if I have lets say terabytes of RAM and use a max_bin=numRecords for some data which is very large (billion rows)

If you use an exact method and assuming all your observations have different values for each feature, then with a non-binning method, you are effectively sorting through billions of values (computationally very expensive).

AbdealiLoKo commented 6 years ago

Bringing this thread back up as I was trying out lightgbm after a while again.

In my case, I have a relatively small dataset (500k rows or so) and am looking to create an greedy-exact GBM out of it as it is the method the current project I am working on uses. In this project, it is difficult to explain the idea of binning and pros and cons as greedy-exact makes intuitive sense to the team using the model

I hope to use binning and other features of light-gbm at a later point hence was hoping to create the ML stack with light-gbm.

In such a set-up if your recommendation don't use lightgbm ? (Personally I'd prefer using the same stack. Would you guys be open to me contributing a greedy_tree_learner.cpp ?)

Laurae2 commented 6 years ago

@AbdealiJK Use xgboost for the exact LightGBM, even though it's depth-wise it should regulate more than the loss-guided LightGBM. Are there real cases where you need more than 2^16 (approx 65K) unique values per feature? (other than physics, chemistry, etc. where exact can be very useful)

In my case, I have a relatively small dataset (500k rows or so) and am looking to create an greedy-exact GBM out of it as it is the method the current project I am working on uses. In this project, it is difficult to explain the idea of binning and pros and cons as greedy-exact makes intuitive sense to the team using the model

I think if you use gigantic max_bin then you should be able to reproduce the exact GBM.

@guolinke Is it possible to output the number of bins generated for each feature before training starts? (with a separate parameter, it may help troubleshooting why there may be features with less bins than the number of unique values)

Something like:

this_feature1: number_of_unique_values1 uniques (number_of_bins_generated2 bins)
this_feature2: number_of_unique_values2 uniques (number_of_bins_generated2 bins)
AbdealiLoKo commented 6 years ago

I'm actually not sure if there is a real world case where greedy is needed. Just a constraint I've been given that I'm trying to work with =D