microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.67k stars 3.83k forks source link

Histogram split algorithm boundaries #4086

Open aceson28 opened 3 years ago

aceson28 commented 3 years ago

Im use pyhton Im in understandinding about lightgbm What im understand is lightgbm use histogram split algorithm to make program lightier From journal that i read, histogram bondaries used as split candidate, and from split candidate who gives max gain is used as tree split.

There are some article who said boundaries get with divide range of feature with num of bins same when we make histogram The other article specially xgboost (weighted quantile )tell every bin must same sum of hessian ,because regression hessian is 1 so if we use 4 bins so we use fiture quantile (0.25, 0.5 and 0.75)

But when i do that manually and compare it with tree plot why is diffrent? My question is how to calculate histogram bondaries? What method lightgbm use to find histogram bondaries?

jameslamb commented 3 years ago

Hi @aceson28 , thanks for using LightGBM!

If you're comfortable reading C++ code, you could look at the ConstructHistogram* methods in the following relevant places where features are binned.

Binning continuous features, bundling sparse features, and filtering out unsplittable features is all done in construction of the Dataset object.

You can also find details on this in the original LightGBM paper, https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.

aceson28 commented 3 years ago

Hi @aceson28 , thanks for using LightGBM!

If you're comfortable reading C++ code, you could look at the ConstructHistogram* methods in the following relevant places where features are binned.

Binning continuous features, bundling sparse features, and filtering out unsplittable features is all done in construction of the Dataset object.

You can also find details on this in the original LightGBM paper, https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.

@jameslamb i'm sorry i'm still don't understand, from Alg 1 in that journal i get H[bin].y ← H[bin].y + I.y[j] (sum of gradient) H[bin].n ← H[bin].n + 1 (sum of hessian)

but i still don't get it in this part (bin ← I.f[k][j].bin), because where that get it .bin ? because row before it just (H ← new Histogram()) and (Build histogram) I'm still don't get it, where they get histogram boundaries?

about link you gave, i'm still read and understand it, because im still newbee in C++ btw boundaries that i mean is value beetwen the two bins, i said that because i'm afraid there will be mistake in my question.

aceson28 commented 3 years ago

@jameslamb from subsubsection 5.1.1 https://everdark.github.io/k9/notebooks/ml/gradient_boosting/gbt.nb.html#511_greedy_equal-density_binning suppose I have a dataset, it will be sampled to 200,000 (because bin_construct_sample_cnt=200000) and if max_bin=4 so every bin will have 50,000 data to make equal to density, so its like quantile (0.25, 0.5 and 0.75)? first of all, is that true (lightgbm use equal bin density strategies of his data sample)? if that true how about if the dataset less than 200,000? is the sample is overlapping? and if lightgbm use equal bin density strategies why there are still need min_data_in_bin?

my second understanding is quantile can make same value for boundaries, min_data_in_bin is used to keep away from overlapping boundaries? if that true, so if there is a huge amount of same data value is it going to the next bin or a bin before?

jameslamb commented 3 years ago

I think @shiyu1994 or @btrotta might give you a better answer than I can, I don't want to say something incorrect.

aceson28 commented 3 years ago

@jameslamb @shiyu1994 @btrotta Hi in Histogram construction there is a sampling (bin_construct_sample_cnt) and in GOSS there is sampling too in ((b/other_rate) "Then it randomly samples b×100% instances from the rest of the data" Subsection 3.1) is that the same sampling or its different? if it's different, how many times sampling in lightgbm? with so much sampling, won't it make the data far from the actual situation?

shiyu1994 commented 3 years ago

@aceson28 Thanks for using LightGBM.

The two sampling are differnt.

The histogram bin boundaries are determined before training the trees. And the boundaries are fixed during the whole boosting process. First, LightGBM will sample bin_construct_sample_cnt data points from the training dataset (we name it sample set SBin). Then, the boundaries are chosen so that the data in the SBin fall as evenly as possible to different bins of each feature. That means, when choosing bin boundaries of a feature, we try to make each bin contains almost the same number of data in SBin set.

The GOSS sampling is totally a different procedure. This sampling is performed before every boosting iteration. And the sampling strategy are described as in the paper https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf. Briefly, LightGBM first chooses the data points with large gradients, and then randomly samples from the reset of data. The gradient values of randomly sampled data are multiplied by a scaling factor so that the gradient sum after the GOSS sampling will be an unbiased estimation of the original sum.

Does that answer your question? If any further question, please feel free to post it here.

aceson28 commented 3 years ago

@shiyu1994 thanks for answering, but I'm still don't understand, when I'm building a model in the params I set bin_construct_sample_cnt = 200000 (default) and max_bin=5 when I do it manually with this code: pandas.qcut(datatrain['numerical.feature.name'].sample(n=200000,replace=True), q=5,duplicates='drop').dtypes why the boundaries is different? if the idea is "SBin fall as evenly as possible to different bins of each feature" I think the code when I do manually is same

correct me if I'm wrong, so we have a dataset and sample it to SBin and with Sbin we use it to make histogram boundaries (these boundaries will be used for a whole process, I mean it just one time for determining the boundaries). after we get the boundaries, we calculate the pseudo residual or negative gradient from the dataset and then sort it by pseudo residual. after that sample it with GOSS (with ax100% data top gradient and bx100% the rest data) and then with that sample and histogram boundaries we build a tree? so every iteration the GOSS sampling will be performed? (btw I use single machine tree learner) is that true? or we calculate pseudo residual with SBin data?

shiyu1994 commented 3 years ago

@aceson28 GOSS samples from the whole training data and SBin has no effect on GOSS sampling. These are two different process which do not interfere with each other. The SBin sampling determines only the bin boundaries, that is, how to map a floating point number feature value to an integer bin value. And these bin values will be fixed during training.

As for the difference result compared with qcut, how much is the difference. I think a small difference is expected. Because though we both follow the evenly-distributed principal, that doesn't mean we use exactly the same algorithm with no difference in details. Also, the results of sampling inside LightGBM and sampling with pandas can be different, since we are using different random generators.

aceson28 commented 3 years ago

@shiyu1994 Capture

no, it's very different, in the model we get the split point 365.5 for the first split, and 1095.5 for the split in depth 3 and 4, but in qcut [(29.999, 365.0], (365.0, 452.0], (452.0, 3653.0]] its just give 3 bins because there are duplicate btw my mistake, its 4 bin not 5 if isn't fix point qcut boundaries in depth 3 is
pd.qcut(datatrain[(datatrain['Periode.Hari.']>365.5)&(datatrain['Umur']>4.5)]['Periode.Hari.'], q=4,duplicates='drop').dtypes [(365.999, 441.0], (441.0, 542.0], (542.0, 3653.0]]

except qcut, is there any ways to see histogram bondaries the same as in lightgbm, what I mean is original boundaries (every possible value) not the best threshold who used in tree (plot_split_value_histogram).