microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

[Python] Crashes and/or segfaults in Alpine Docker containers #1227

Closed jmriebold closed 6 years ago

jmriebold commented 6 years ago

Description

We've been having issues with segfaults and crashes when training LambdaMART models with LightGBM inside Docker containers. Sometimes when the container starts up and loads LightGBM everything works fine, others it segfaults immediately. I can't discern any pattern to when it works vs. when it doesn't, but since the application is containerized everything ought to be the same. I was able to get a traceback for this with GDB a couple months back and although I unfortunately didn't save the message, it had something to do with the Dense4BitsBin() method. For whatever reason I'm no longer able to get a traceback with GDB, so Dense4BitsBin() may end up being a red herring.

Note that I'm unable to reproduce the problem locally (on Arch Linux) and in containers based on other distros (e.g. Debian from python:3.6-slim). So far as I can tell, it only occurs in Alpine.

Environment info

Operating System: Alpine Linux 3.7 CPU: various Python: 3.6

Error Message:

Segmentation fault (core dumped) and/or container crash

Reproducible examples

segfault.zip

The zip file contains a Dockerfile and code to train a model in Python, along with a bash script that runs the model training 1000 times in a row. After some number of iterations train_model.py will segfault, and typically at some point the container will completely stop responding.

Steps to reproduce

  1. Unzip archive and CD to dir
  2. docker build -f Dockerfile -t "alpine-lightgbm-test" .
  3. docker run alpine-lightgbm-test
guolinke commented 6 years ago

@jmriebold can you provide the core dump log of that crash ?

guolinke commented 6 years ago

Since most OS can run it, I think the problem is in Alpine side. @jameslamb feel free to reopen if it is needed.