No Improvement in Training Time with more Cores on LightGBM

Description

Training a 6GB dataset with LightGBM using n_jobs=70 does not result in a proportional reduction in training time. Despite utilizing a machine with 72 cores and setting a high n_jobs value, the training time remains unexpectedly high.

Environment

OS: Linux 6.1.0-27-cloud-amd64 Debian
CPU:
  Architecture:             x86_64
  CPU(s):                   72  
    - Model Name:           Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz  
    - Cores:                72 (1 thread per core)  
    - Flags:                AVX, AVX2, AVX512, FMA, etc.  
  Memory:                   288 MB L2 Cache, 16 MB L3 Cache  
  NUMA Node(s):             1  
Memory:
                    total        used        free      shared  buff/cache   available  
      Mem:           491Gi        81Gi       399Gi       1.1Mi        15Gi       410Gi  
      Swap:           79Gi        84Mi        79Gi  
Storage:
  Filesystem      Size  Used Avail Use% Mounted on  
  udev            246G     0  246G   0% /dev  
  tmpfs            50G  1.8M   50G   1% /run  
  /dev/sda1       197G  104G   86G  55% /  
  tmpfs           246G     0  246G   0% /dev/shm  
  tmpfs           5.0M     0  5.0M   0% /run/lock  
  /dev/sda15      124M   12M  113M  10% /boot/efi  
  tmpfs            50G     0   50G   0% /run/user/10476  
  tmpfs            50G     0   50G   0% /run/user/90289  
  tmpfs            50G     0   50G   0% /run/user/1003  
VM Type: Custom VM on a cloud environment.

LightGBM Setup

  Version: 3.2.1=py38h709712a_0
  Parameters: n_estimators=325, num_leaves=512, colsample_bytree=0.2, min_data_in_leaf=80, max_depth=22, learning_rate=0.09, objective="binary", n_jobs=70, boost_from_average=True, max_bin=200, bagging_fraction=0.999, lambda_l1=0.29, lambda_l2=0.165
Dataset:
  Size: ~6GB
  Characteristics: Binary classification problem, categorical and numerical features, preprocessed and balanced.
Performance Issues
Current Performance:
    Training time with n_jobs=32: ~25 minutes
    Training time with n_jobs=70: ~23 minutes
Expected Performance:
    Substantial reduction in training time when utilizing 70 cores, ideally below 10 minutes.
Bottleneck Symptoms:
    Minimal reduction in training time with increased cores (n_jobs).
    CPU utilization remains low, with individual threads not fully utilized.
System Metrics During Training
   CPU Utilization:
      Average utilization: ~40%  
      Peak utilization: ~55%  
      Core-specific activity: Most cores show low activity levels (<30%)  
   Memory Usage:
      Utilized during training: ~81Gi  
      Free memory: ~399Gi  
      Swap usage: ~84Mi  
  Disk I/O:
      Read: ~50MB/s  
      Write: ~30MB/s  
      I/O wait time: ~2%

Request for Support

Explanation of why n_jobs scaling is not improving training time.

Suggestions for configurations to fully utilize 70 cores for LightGBM training.

Recommendations for debugging and monitoring specific to LightGBM threading or system-level bottlenecks.

Thanks for using LightGBM. I've attempted to reformat your post a bit to make it easier to read... if you are new to markdown / GitHub, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for some tips on making such changes yourself.

You haven't provided enough information yet for us to help you with this report.

can you provide a minimal, reproducible example (docs on that) showing the exact code you're running and how you installed LightGBM?
- You haven't even told us whether you're using the Python package, R package, CLI, etc.
- You haven't told us anything about the shape and content of the dataset, other than it's total size in memory. CPU utilization is heavily dependent on the shape of the input data (e.g. number of rows and columnes) and the distribution of the features (e.g. cardinality of categorical values)
are there any other processes running on the system?
- If you're trying to devote all cores to LightGBM training, they'll be competing with any other work happening on the system.
have I understood correctly that you're using LightGBM 3.2.1?
- if so, please try updating to the latest version (v4.5.0) and tell us if that changes the results. There have been hundreds of bug fixes and improvements in the 4+ years of development between those 2 versions

microsoft / LightGBM