microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.48k stars 3.82k forks source link

Question about the speed of pre-binning algorithms #6084

Open Ye980226 opened 11 months ago

Ye980226 commented 11 months ago

Hello! I'm facing a situation where I have 20 million rows of data and 30,000 features participating in training, but I only have 50 trees. In this case, I've noticed that the training process is very slow because the pre-binning step runs on a single thread. I think I may not fully understand the binning algorithm. Is it possible to run it on multiple threads or can I obtain an approximate binning result externally and then pass it to the training process? I don't need an extremely precise result; an approximate one would suffice.

replacementAI commented 11 months ago

https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_bin You can decrease the max_bin parameters, it defaults to 255 bins, the lower the faster https://lightgbm.readthedocs.io/en/latest/Parameters.html#use_quantized_grad You can also enable quantization, it defaults to 4, the lower the faster

Ye980226 commented 11 months ago

https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_bin You can decrease the max_bin parameters, it defaults to 255 bins, the lower the faster https://lightgbm.readthedocs.io/en/latest/Parameters.html#use_quantized_grad You can also enable quantization, it defaults to 4, the lower the faster

Thank you for your response. I appreciate your suggestions, and I am currently trying out your methods. However, I haven't obtained any results yet, and the time it's taking already exceeds my desired timeframe. Since I may need to run similar datasets multiple times, I was wondering if there are any multi-threading solutions available to speed up the process.

shiyu1994 commented 11 months ago

@Ye980226 Thanks for using LightGBM. Do you mean the data preprocessing (which bins the raw data into discretized integers) is slow? This process should by default run in multiple threads. How is the num_threads set in your program?

Ye980226 commented 11 months ago

@shiyu1994 Thank you for your response. I apologize for any confusion. I might not be very familiar with the entire workflow of LightGBM.

params = {
    "boosting_type":"goss",
    "learning_rate": 0.2,
    "max_depth": 5,
    "num_leaves": 20,
    "min_data_in_leaf": 1000,
    "metric": "rmse",
    "objective":args.objective,
    "force_col_wise":True,
    "max_bin":8,
    "num_threads":30,
}
model = lightgbm.train(params, train_dataset, 50, [
                       train_dataset])

Here is my parameter list and how I use it.

[LightGBM] [Info] Total Bins 23071 [LightGBM] [Info] Number of data points in the train set: 516301, number of used features: 2890

When I mentioned single-threaded, I meant before these two lines of code are printed. After seeing these two lines, the training process appears to be multi-threaded.So I suspect that the binning algorithm might be slow.

import numpy as np
import lightgbm
train_x=np.random.randn(10000000,10000)
train_y=np.random.randn(10000000)
train_dataset=lightgbm.Dataset(train_x,train_y)
params = {
    "boosting_type":"goss",
    "learning_rate": 0.2,
    "max_depth": 5,
    "num_leaves": 20,
    "min_data_in_leaf": 1000,
    "metric": "rmse",
    "objective":"mse",
    "force_col_wise":True,
    "max_bin":8,
    "num_threads":30,
}
model = lightgbm.train(params, train_dataset, 50, [
                       train_dataset])

Perhaps you can run this piece of code. There is a significant amount of time in single-threaded execution before the multi-threading kicks in. I'm not sure which part of the LightGBM workflow runs in a single thread.