microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.71k stars 3.83k forks source link

Question: How does lightGBM create histogram? #5943

Open weipenegHU opened 1 year ago

weipenegHU commented 1 year ago

Hi, I am recently studying the principles behind lightGBM. I could not understand the pseudo code about creating histogram on the paper. I will be glad that if somebody can explain it. Is it based on the frequency, which is to say that the amount of samples in each bin is the same, or is it based on the child weight like what XGB did ?

jrvmalik commented 1 year ago

You mean this? image Features are discretized into bins, instead of having an array of feature values, you have an array of bin addresses.
There are two histograms being constructed whenever you are adding a node to a tree: gradient histograms and hessian histograms.

import numpy as np

feature_bins = np.array([0, 1,2,3,4, 4,3,2,1, 0], dtype=np.uint8)
gradients = np.random.randn(*feature_bins.shape)  # gradients, hessians = fobj(predictions, targets)
hessians = np.random.randn(*feature_bins.shape)  # these are given to you by the objective function

gradient_histogram = np.zeros(5)
hessian_histogram = np.zeros(5)

for idx, feature_bin in enumerate(feature_bins):
    gradient_histogram[feature_bin] += gradients[idx]
    hessian_histogram[feature_bin] += hessians[idx]  # for some objectives, the hessian is always 1
weipenegHU commented 1 year ago

Thanks for the response @jrvmalik . I wonder, in your explaination, do you assume that the number of values in each bin are the same, which means the bins are created based on frequency rather than sample weights which are defined by the hessian in XGBoost

jrvmalik commented 1 year ago

No, LightGBM doesn't use either the Hessian or the sample weights to determine the bin assignments. The bin assignments are fixed before training