Closed AbdealiLoKo closed 7 years ago
@AbdealiJK
The whole idea of LightGBM is based on binning. So it is not possible without binning.
You can use large max_bin
, which is like the non-binning.
However, the parameter of binning actually doesn't have much effects on accuracy. for the accuracy, it doesn't have much loss when using binning. And binning will bring the better generalization power.
Is there a limit on the max_bin
parameter?
I have not tried this yet, but I am guessing if I have lets say terabytes of RAM and use a max_bin=numRecords
for some data which is very large (billion rows) ... this should be giving me the exact same result as no binning at all ?
@AbdealiJK
I think it is better to smaller than 65535
(2^16 - 1) .
using num_redcords is absolutely not needed.
large max_bin will be much slower, and may hurt the test accuracy.
This reduces the number of hyper parameters that need to be estimated
It shouldn't be a hyperparameter to tune, but a parameter to define depending on the memory pressure your computer can handle. If your model takes 200GB in memory and you have only 150GB RAM, reducing the number of bins (squared) will reduce the number of bits required. Using a minimal amount of bins will obviously reduce the performance of the model, while using a larger amount of bins will automatically converge to exact GBMs' performance.
This can be used to compare techniques and libraries easily
Binning is not made the same way depending on libraries, for instance xgboost fast histogram and LightGBM binnings have nothing in common other than they deliver bins.
In addition, there is no issue comparing (for instance) xgboost exact and xgboost fast histogram / LightGBM: the former is tailored for exactness, while the latter are for speed.
For single machine tasks it could be more accurate that binning
In most cases binning will provide higher performance.
Not allowing the user to decide whether he wants binning or not is a deal breaker as some folks may not want the quantile sketch based binning
The user can use xgboost exact in this case, at the cost of 10x slower speed (and even more, on very large datasets). If the user wants to use LightGBM without binning, this is not what LightGBM is made for.
if I have lets say terabytes of RAM and use a max_bin=numRecords for some data which is very large (billion rows)
If you use an exact method and assuming all your observations have different values for each feature, then with a non-binning method, you are effectively sorting through billions of values (computationally very expensive).
Bringing this thread back up as I was trying out lightgbm after a while again.
In my case, I have a relatively small dataset (500k rows or so) and am looking to create an greedy-exact
GBM out of it as it is the method the current project I am working on uses. In this project, it is difficult to explain the idea of binning and pros and cons as greedy-exact makes intuitive sense to the team using the model
I hope to use binning and other features of light-gbm at a later point hence was hoping to create the ML stack with light-gbm.
In such a set-up if your recommendation don't use lightgbm ?
(Personally I'd prefer using the same stack. Would you guys be open to me contributing a greedy_tree_learner.cpp
?)
@AbdealiJK Use xgboost for the exact LightGBM, even though it's depth-wise it should regulate more than the loss-guided LightGBM. Are there real cases where you need more than 2^16 (approx 65K) unique values per feature? (other than physics, chemistry, etc. where exact can be very useful)
In my case, I have a relatively small dataset (500k rows or so) and am looking to create an greedy-exact GBM out of it as it is the method the current project I am working on uses. In this project, it is difficult to explain the idea of binning and pros and cons as greedy-exact makes intuitive sense to the team using the model
I think if you use gigantic max_bin
then you should be able to reproduce the exact GBM.
@guolinke Is it possible to output the number of bins generated for each feature before training starts? (with a separate parameter, it may help troubleshooting why there may be features with less bins than the number of unique values)
Something like:
this_feature1: number_of_unique_values1 uniques (number_of_bins_generated2 bins)
this_feature2: number_of_unique_values2 uniques (number_of_bins_generated2 bins)
I'm actually not sure if there is a real world case where greedy is needed. Just a constraint I've been given that I'm trying to work with =D
In some cases, it is very critical to use the exact solution of GBM which it theoretically guarantees.
Not allowing the user to decide whether he wants binning or not is a deal breaker as some folks may not want the quantile sketch based binning