A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
We've been having issues with segfaults and crashes when training LambdaMART models with LightGBM inside Docker containers. Sometimes when the container starts up and loads LightGBM everything works fine, others it segfaults immediately. I can't discern any pattern to when it works vs. when it doesn't, but since the application is containerized everything ought to be the same. I was able to get a traceback for this with GDB a couple months back and although I unfortunately didn't save the message, it had something to do with the Dense4BitsBin() method. For whatever reason I'm no longer able to get a traceback with GDB, so Dense4BitsBin() may end up being a red herring.
Note that I'm unable to reproduce the problem locally (on Arch Linux) and in containers based on other distros (e.g. Debian from python:3.6-slim). So far as I can tell, it only occurs in Alpine.
Environment info
Operating System: Alpine Linux 3.7
CPU: various
Python: 3.6
The zip file contains a Dockerfile and code to train a model in Python, along with a bash script that runs the model training 1000 times in a row. After some number of iterations train_model.py will segfault, and typically at some point the container will completely stop responding.
Description
We've been having issues with segfaults and crashes when training LambdaMART models with LightGBM inside Docker containers. Sometimes when the container starts up and loads LightGBM everything works fine, others it segfaults immediately. I can't discern any pattern to when it works vs. when it doesn't, but since the application is containerized everything ought to be the same. I was able to get a traceback for this with GDB a couple months back and although I unfortunately didn't save the message, it had something to do with the
Dense4BitsBin()
method. For whatever reason I'm no longer able to get a traceback with GDB, soDense4BitsBin()
may end up being a red herring.Note that I'm unable to reproduce the problem locally (on Arch Linux) and in containers based on other distros (e.g. Debian from python:3.6-slim). So far as I can tell, it only occurs in Alpine.
Environment info
Operating System: Alpine Linux 3.7 CPU: various Python: 3.6
Error Message:
Segmentation fault (core dumped) and/or container crash
Reproducible examples
segfault.zip
The zip file contains a Dockerfile and code to train a model in Python, along with a bash script that runs the model training 1000 times in a row. After some number of iterations train_model.py will segfault, and typically at some point the container will completely stop responding.
Steps to reproduce
docker build -f Dockerfile -t "alpine-lightgbm-test" .
docker run alpine-lightgbm-test