microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.61k stars 3.83k forks source link

[CUDA] illegal memory access when using CUDA and large max_bin and large dataset #6512

Open LZhen0711 opened 3 months ago

LZhen0711 commented 3 months ago

Description

By using CUDA histogram of the master branch, the simple python code report memory error if it uses large max_bin size

Reproducible example

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import lightgbm as lgbm
X,y = make_regression(n_samples=4000000, n_features=50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = lgbm.LGBMRegressor(device="cuda", max_bin=300)
model.fit(X_train, y_train)

And it will report error:

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Total Bins 15000
[LightGBM] [Info] Number of data points in the train set: 3000000, number of used features: 50
[LightGBM] [Info] Start training from score 0.023500
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37

Aborted

Environment info

GPU: NVIDIA GeForce RTX 3060 Python: 3.12.4 LightGBM version or commit hash: master branch

# FROM nvidia/cuda:8.0-cudnn5-devel
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

#################################################################################################################
#           Global
#################################################################################################################
# apt-get to skip any interactive post-install configuration steps with DEBIAN_FRONTEND=noninteractive and apt-get install -y

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive

#################################################################################################################
#           Global Path Setting
#################################################################################################################

ENV CUDA_HOME /usr/local/cuda
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/lib

ENV OPENCL_LIBRARIES /usr/local/cuda/lib64
ENV OPENCL_INCLUDE_DIR /usr/local/cuda/include

#################################################################################################################
#           TINI
#################################################################################################################

# Install tini
ENV TINI_VERSION v0.14.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

#################################################################################################################
#           SYSTEM
#################################################################################################################
# update: downloads the package lists from the repositories and "updates" them to get information on the newest versions of packages and their
# dependencies. It will do this for all repositories and PPAs.

RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
git \
vim \
mercurial \
subversion \
cmake \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
gcc \
g++

# Add OpenCL ICD files for LightGBM
RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

#################################################################################################################
#           CONDA
#################################################################################################################

ARG CONDA_DIR=/opt/miniforge
# add to path
ENV PATH $CONDA_DIR/bin:$PATH

# Install miniforge
RUN echo "export PATH=$CONDA_DIR/bin:"'$PATH' > /etc/profile.d/conda.sh && \
    curl -sL https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -o ~/miniforge.sh && \
    /bin/bash ~/miniforge.sh -b -p $CONDA_DIR && \
    rm ~/miniforge.sh

RUN conda config --set always_yes yes --set changeps1 no && \
    conda create -y -q -n py3 numpy scipy scikit-learn jupyter notebook ipython pandas matplotlib

#################################################################################################################
#           LightGBM
#################################################################################################################

RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
    git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    mkdir build && cd build && cmake -DUSE_CUDA=1 .. && make -j4 && cd ..

ENV PATH /usr/local/src/lightgbm/LightGBM:${PATH}

RUN /bin/bash -c "source activate py3 && cd /usr/local/src/lightgbm/LightGBM && sh ./build-python.sh install --precompile && source deactivate"

#################################################################################################################
#           System CleanUp
#################################################################################################################
# apt-get autoremove: used to remove packages that were automatically installed to satisfy dependencies for some package and that are no more needed.
# apt-get clean: removes the aptitude cache in /var/cache/apt/archives. You'd be amazed how much is in there! the only drawback is that the packages
# have to be downloaded again if you reinstall them.

RUN apt-get autoremove -y && apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    conda clean -a -y

#################################################################################################################
#           JUPYTER
#################################################################################################################

# password: keras
# password key: --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824'

# Add a notebook profile.
RUN mkdir -p -m 700 ~/.jupyter/ && \
    echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py

VOLUME /home
WORKDIR /home

# IPython
EXPOSE 8888

ENTRYPOINT [ "/tini", "--" ]
CMD /bin/bash -c "source activate py3 && jupyter notebook --allow-root --no-browser --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824' && source deactivate"

LightGBM version or commit hash:

Command(s) you used to install LightGBM

Additional Comments

Ryednap commented 3 months ago

I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes. image

In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup

fixed_params = {
    "objective": "binary",
    "metric": "auc",
    "boosting_type": "gbdt",
    "data_sample_strategy": "bagging",
    "num_iterations": 5000,
    "device_type": "cuda",
    "random_state": 6241,
    "force_row_wise": True,
    "bagging_seed": 113,
    "early_stopping_rounds": 100,
    "verbose": 2,
}
  gbm = lightgbm.train(
      **fixed_params,
      train_pool,
      valid_sets=[valid_pool],
      valid_names=['valid'],
  )

Here's the LGBM log before it crashes image

Here are my Env Info

  1. Driver Version: 535.104.05 CUDA Version: 12.2
  2. lightgbm==4.4.0 but I have verified that this behavior is the same in v4.2.0.
  3. T4 GPU on colab with 15 gigs of GPU RAM.