microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.55k stars 3.82k forks source link

[RFC] [doc] Add long-form documentation on sampling in LightGBM #5070

Open jameslamb opened 2 years ago

jameslamb commented 2 years ago

Summary

There are several points in the process of training a LightGBM model where less than the full training data is used.

I think it would be valuable to add a section called "Sampling" or similar at https://lightgbm.readthedocs.io/en/latest/Features.html, describing these concepts.

Motivation

There are many parameters available to control the different types of sampling, and the interactions between them are more complex than can be clearly expressed in the documentation in any individual parameter's docs at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I believe such documentation would significantly improve users' understanding of how LightGBM works, and help them to make informed decisions about values for LightGBM's parameters.

Description

My idea for this is several paragraphs like the following, mixing an explanation of LightGBM processes with the names of specific parameters that can be used to control it.

LightGBM does not perform boosting directly on the raw values in input data. Instead, it performs some pre-processing such as binning continuous features into histograms, bundling sparse features together, and performing target encoding on categorical features.

This pre-processing creates an object called a Dataset. To improve the speed of Dataset construction, LightGBM samples the input data to determine characteristics like histogram bin boundaries. Use parameter bin_construct_sample_cnt (default=200000) to control how many observations are sample during this process, and data_random_seed to make the process reproducible.

Other themes that I think should be covered

References

Created this based on the discussion in #4827.

shiyu1994 commented 2 years ago

@jameslamb Thanks for writing up this. I think after we merging #5091, we can add the section for sampling algorithms once for all. WDYT?