h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.82k stars 157 forks source link

segfault on Ubuntu 20.04 when in combination with LightGBM #2453

Closed arnocandel closed 3 years ago

arnocandel commented 4 years ago
# on host
cd /tmp/
wget https://files.slack.com/files-pri/T0329MHH6-F013VU6RW94/download/dt_lgb.gz?pub_secret=fb7b5f3988
mv 'dt_lgb.gz?pub_secret=fb7b5f3988' dt_lgb.gz
tar xfz dt_lgb.gz
docker pull ubuntu:20.04
docker run -t -v `pwd`:/tmp --security-opt seccomp=unconfined -i ubuntu:20.04 /bin/bash

# on Ubuntu 20.04
chmod 1777 /tmp
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y software-properties-common
add-apt-repository -y ppa:deadsnakes/ppa
apt-get update
apt-get install -y python3.6 python3.6-dev virtualenv libgomp1 gdb vim valgrind

# repro failure
virtualenv -p python3.6 blah
source blah/bin/activate
pip install datatable
pip install lightgbm
pip install pandas
cd /tmp/
python lgb_prefit_df669346-4e47-4ecf-b131-0838ae8f9474.py

fails with:

/blah/lib/python3.6/site-packages/lightgbm/basic.py:1295: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is []
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
/blah/lib/python3.6/site-packages/lightgbm/basic.py:842: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
Segmentation fault (core dumped)
arnocandel commented 4 years ago
(blah) root@8b5e9ef6251f:/tmp# gdb blah/bin/python core-python.6234.8b5e9ef6251f.1590087160 
GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from blah/bin/python...
(No debugging symbols found in blah/bin/python)

warning: core file may not match specified executable file.
[New LWP 6234]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python lgb_prefit_df669346-4e47-4ecf-b131-0838ae8f9474.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd24cbb2e65 in PyArray_FromAny (op=0x7ffd07d9ff80, newtype=0x1ae8970, min_depth=131722224, max_depth=31176544, flags=28215752, context=<optimized out>)
    at numpy/core/src/multiarray/ctors.c:1946
1946    numpy/core/src/multiarray/ctors.c: No such file or directory.
(gdb) bt
#0  0x00007fd24cbb2e65 in PyArray_FromAny (op=0x7ffd07d9ff80, newtype=0x1ae8970, min_depth=131722224, max_depth=31176544, flags=28215752, context=<optimized out>)
    at numpy/core/src/multiarray/ctors.c:1946
#1  0x00007ffd07d9ee70 in ?? ()
#2  0x00007ffd07d9ee60 in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0x00007ffd00000002 in ?? ()
#5  0x0000000000000000 in ?? ()
arnocandel commented 4 years ago

without import datatable, it runs fine (up to the point where it fails due to missing GPU compilation, ignore that)

arnocandel commented 4 years ago

works fine in Ubuntu 18.04

lightgbm.basic.LightGBMError: GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1

^ means it didn't segfault, so good

st-pasha commented 4 years ago

The crash occurs at lightgbm/basic.py:1713 __init__(), in call to _LIB.LGBM_BoosterCreate(). The dataset that is passed is a numpy array of shape (23999,24) and dtype float32.

It is not clear why importing datatable up-front causes any change in behavior, since lightgbm itself tries to import datatable at startup...

arnocandel commented 4 years ago

https://github.com/microsoft/LightGBM/blob/44a9120197fbfc7f3bfb9349e4eb2e9443c1971b/python-package/lightgbm/compat.py#L95-L109 yeah

arnocandel commented 4 years ago

also, when I compile LightGBM on Ubuntu 20.04, it works fine

st-pasha commented 4 years ago

If I set a breakpoint for LGBM_BoosterCreate, I see the following stacktrace:

(gdb) bt
#0  0x00007fffc470cc50 in LGBM_BoosterCreate () from /tmp/blah/lib/python3.6/site-packages/lightgbm/lib_lightgbm.so
#1  0x00007ffff6cc4ff5 in ?? () from /lib/x86_64-linux-gnu/libffi.so.7
#2  0x00007ffff6cc440a in ?? () from /lib/x86_64-linux-gnu/libffi.so.7
#3  0x00007ffff6cdd414 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#4  0x00007ffff6cdc590 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#5  0x00000000004c1074 in _PyObject_FastCallKeywords ()
#6  0x00000000005343b6 in ?? ()
#7  0x000000000052d6cc in _PyEval_EvalFrameDefault ()
#8  0x000000000052d02e in ?? ()
#9  0x0000000000535d29 in _PyFunction_FastCallDict ()
#10 0x00000000004c1d19 in _PyObject_Call_Prepend ()
#11 0x00000000004c13c5 in PyObject_Call ()
#12 0x00000000004ff700 in ?? ()
#13 0x00000000004fcaa6 in ?? ()
#14 0x00000000004c1074 in _PyObject_FastCallKeywords ()
#15 0x00000000005343b6 in ?? ()
#16 0x000000000052e54d in _PyEval_EvalFrameDefault ()
#17 0x000000000052c893 in ?? ()
#18 0x000000000053531f in ?? ()
#19 0x0000000000534341 in ?? ()
#20 0x000000000052e54d in _PyEval_EvalFrameDefault ()
#21 0x000000000052d02e in ?? ()
#22 0x000000000053531f in ?? ()
#23 0x0000000000534341 in ?? ()
#24 0x000000000052e54d in _PyEval_EvalFrameDefault ()
#25 0x000000000052c893 in ?? ()
#26 0x000000000053531f in ?? ()
#27 0x0000000000534341 in ?? ()
#28 0x000000000052e54d in _PyEval_EvalFrameDefault ()
#29 0x0000000000535295 in ?? ()
#30 0x0000000000534341 in ?? ()
#31 0x000000000052d6cc in _PyEval_EvalFrameDefault ()
#32 0x000000000052c893 in ?? ()
#33 0x00000000005bec37 in PyEval_EvalCode ()
#34 0x0000000000573ae3 in ?? ()
#35 0x0000000000573efb in PyRun_FileExFlags ()
#36 0x0000000000573cdd in PyRun_SimpleFileExFlags ()
#37 0x000000000057a9bc in Py_Main ()
#38 0x00000000004b40f5 in main ()

which is perfectly reasonable, and what I would normally expect.

Setting then a breakpoint for PyArray_FromAny causes segfault without triggering the breakpoint:

(gdb) b PyArray_FromAny
Breakpoint 4 at 0x7ffff648ca80: file numpy/core/src/multiarray/ctors.c, line 1892.
(gdb) c
Continuing.

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff648ce65 in PyArray_FromAny (op=0x7fffffffb2c0, newtype=0x171c6d0, min_depth=-24784, max_depth=27204240, flags=24233768, context=<optimized out>) at numpy/core/src/multiarray/ctors.c:1946
1946                Py_DECREF(dtype);
(gdb) 

The backtrace is no longer valid:

(gdb) bt
#0  0x00007ffff648ce65 in PyArray_FromAny (op=0x7fffffffb2c0, newtype=0x171c6d0, min_depth=-24784, max_depth=27204240, flags=24233768, context=<optimized out>)
    at numpy/core/src/multiarray/ctors.c:1946
#1  0x00007fffffffa1b0 in ?? ()
#2  0x00007fffffffa1a0 in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0x00007fff00000002 in ?? ()
#5  0x0000000000000000 in ?? ()
st-pasha commented 4 years ago

I suspect that the issue is actually in the order of module loading. If I rearrange imports as

import _pickle as pickle
import pandas as pd
import datatable as dt
import numpy as np

then there is no longer a crash

arnocandel commented 4 years ago

yes, confirmed, works in DAI too if pandas imported first

st-pasha commented 4 years ago

I tried doing info sharedlibrary in gdb, and it appears that the set of dynamic libraries that got imported is the same in both cases, however their order is different (naturally). This is the diff:

--- a/dtpd
+++ b/pddt
@@ -6,34 +6,26 @@
 /lib/x86_64-linux-gnu/libz.so.1
 /lib/x86_64-linux-gnu/libm.so.6
 /lib/x86_64-linux-gnu/libc.so.6
-/tmp/blah/lib/python3.6/site-packages/datatable/lib/_datatable.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libstdc++.so.6
-/lib/x86_64-linux-gnu/libgcc_s.so.1
-/usr/lib/python3.6/lib-dynload/_bz2.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libbz2.so.1.0
-/usr/lib/python3.6/lib-dynload/_lzma.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/liblzma.so.5
-/usr/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libcrypto.so.1.1
-/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libffi.so.7
-/usr/lib/python3.6/lib-dynload/_opcode.cpython-36m-x86_64-linux-gnu.so
-/usr/lib/python3.6/lib-dynload/_curses.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libncursesw.so.6
-/lib/x86_64-linux-gnu/libtinfo.so.6
-/usr/lib/python3.6/lib-dynload/termios.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
 /tmp/blah/lib/python3.6/site-packages/numpy/core/../../numpy.libs/libgfortran-ed201abd.so.3.0.0
 /tmp/blah/lib/python3.6/site-packages/numpy/core/_multiarray_tests.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libffi.so.7
 /tmp/blah/lib/python3.6/site-packages/numpy/linalg/lapack_lite.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/linalg/_umath_linalg.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_bz2.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libbz2.so.1.0
+/usr/lib/python3.6/lib-dynload/_lzma.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/liblzma.so.5
 /usr/lib/python3.6/lib-dynload/_decimal.cpython-36m-x86_64-linux-gnu.so
 /lib/x86_64-linux-gnu/libmpdec.so.2
 /tmp/blah/lib/python3.6/site-packages/numpy/fft/_pocketfft_internal.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/random/mtrand.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libcrypto.so.1.1
 /tmp/blah/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/numpy/random/_philox.cpython-36m-x86_64-linux-gnu.so
@@ -63,6 +55,7 @@
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/tslib.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/interval.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/algos.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_opcode.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/properties.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/hashing.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/ops.cpython-36m-x86_64-linux-gnu.so
@@ -76,6 +69,8 @@
 /usr/lib/python3.6/lib-dynload/mmap.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/reshape.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/window/aggregations.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libstdc++.so.6
+/lib/x86_64-linux-gnu/libgcc_s.so.1
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/window/indexers.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/groupby.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/reduction.cpython-36m-x86_64-linux-gnu.so
@@ -83,6 +78,11 @@
 /usr/lib/python3.6/lib-dynload/_csv.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/json.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/pandas/_libs/testing.cpython-36m-x86_64-linux-gnu.so
+/tmp/blah/lib/python3.6/site-packages/datatable/lib/_datatable.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_curses.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libncursesw.so.6
+/lib/x86_64-linux-gnu/libtinfo.so.6
+/usr/lib/python3.6/lib-dynload/termios.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/scipy/_lib/_ccallback_c.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/scipy/_lib/_uarray/_uarray.cpython-36m-x86_64-linux-gnu.so
 /tmp/blah/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so

I can surmise that perhaps it could be a name clash with one of the globally defined symbols? Not sure how to proceed at this point...

st-pasha commented 4 years ago

@arnocandel So, in summary:

At this point it is unclear how to debug this problem any further, nor whether it is even possible to fix it within datatable.

arnocandel commented 4 years ago

We could compile at least lightgbm “the normal way” with debug symbols, and bring into Ubuntu 20.

arnocandel commented 4 years ago

I got this in valgrind (DAI):

==771409== Conditional jump or move depends on uninitialised value(s)
==771409==    at 0x63F90D1: std::random_device::_M_getval() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==771409==    by 0x7F16CDC9: operator() (random.h:1620)
==771409==    by 0x7F16CDC9: LightGBM::Random::Random() (random.h:25)
==771409==    by 0x7F303BD8: LightGBM::SerialTreeLearner::SerialTreeLearner(LightGBM::Config const*) (serial_tree_learner.cpp:29)
==771409==    by 0x7F2DD5AF: LightGBM::GPUTreeLearner::GPUTreeLearner(LightGBM::Config const*) (gpu_tree_learner.cpp:23)
==771409==    by 0x7F31537F: LightGBM::TreeLearner::CreateTreeLearner(std::string const&, std::string const&, LightGBM::Config const*) (tree_learner.cpp:26)
==771409==    by 0x7F17DF6D: LightGBM::GBDT::Init(LightGBM::Config const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) (gbdt.cpp:78)
==771409==    by 0x7F1302D9: Booster (c_api.cpp:81)
==771409==    by 0x7F1302D9: LGBM_BoosterCreate (c_api.cpp:1023)
==771409==    by 0x4857FF4: ??? (in /usr/lib/x86_64-linux-gnu/libffi.so.7.1.0)
==771409==    by 0x4857409: ??? (in /usr/lib/x86_64-linux-gnu/libffi.so.7.1.0)
==771409==    by 0x6972413: _ctypes_callproc (in /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so)
==771409==    by 0x697158F: ??? (in /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so)
==771409==    by 0x4C1073: _PyObject_FastCallKeywords (in /usr/bin/python3.6)
==771409== 
arnocandel commented 4 years ago

and this from core file (DAI)

(gdb) bt
#0  0x00007f25539a8c40 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1  0x00007f25539aa77b in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#2  0x00007f25558c9fa6 in __GI___backtrace (array=array@entry=0x7ffe82ee3110, size=size@entry=256) at backtrace.c:116
#3  0x00007f2553acc8ad in catch_segfault (signal=11, info=<optimized out>, ctx=0x7ffe82ee4900) at ../debug/segfault.c:102
#4  <signal handler called>
#5  0x0000000000000003 in ?? ()
#6  0x00007f23fa34edca in std::random_device::operator() (this=0x7ffe82ee4fc0) at /usr/include/c++/4.8.2/bits/random.h:1620
#7  LightGBM::Random::Random (this=0x5cea250) at /root/repo/LightGBM/include/LightGBM/utils/random.h:25
#8  0x00007f23fa4e5bd9 in LightGBM::SerialTreeLearner::SerialTreeLearner (this=0x5cea220, config=0x5cc1ae0) at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp:29
#9  0x00007f23fa4bf5b0 in LightGBM::GPUTreeLearner::GPUTreeLearner (this=0x5cea220, config=<optimized out>) at /root/repo/LightGBM/src/treelearner/gpu_tree_learner.cpp:23
#10 0x00007f23fa4f7380 in LightGBM::TreeLearner::CreateTreeLearner (Python Exception <class 'gdb.error'> No type named class std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep.: 
learner_type=, device_type=..., config=0x5cc1ae0) at /root/repo/LightGBM/src/treelearner/tree_learner.cpp:26
#11 0x00007f23fa35ff6e in LightGBM::GBDT::Init (this=this@entry=0x5cbd880, config=config@entry=0x5cd50e0, train_data=<optimized out>, objective_function=<optimized out>, 
    training_metrics=std::vector of length 1, capacity 1 = {...}) at /root/repo/LightGBM/src/boosting/gbdt.cpp:78
#12 0x00007f23fa3122da in LightGBM::Booster::Booster (parameters=<optimized out>, train_data=<optimized out>, this=0x5cd50a0) at /root/repo/LightGBM/src/c_api.cpp:81
#13 LGBM_BoosterCreate (train_data=<optimized out>, parameters=<optimized out>, out=0x7f23fa78c340) at /root/repo/LightGBM/src/c_api.cpp:1023
#14 0x00007f25550d7ff5 in ?? () from /usr/lib/x86_64-linux-gnu/libffi.so.7
#15 0x00007f25550d740a in ?? () from /usr/lib/x86_64-linux-gnu/libffi.so.7
#16 0x00007f2554a3c414 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#17 0x00007f2554a3b590 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#18 0x00000000004c1074 in _PyObject_FastCallKeywords ()
#19 0x00000000005343b6 in ?? ()
arnocandel commented 4 years ago

https://gcc.gnu.org/onlinedocs/gcc-4.8.2/libstdc++/api/a01452_source.html L1620 shows that we're in _GLIBCXX_USE_RANDOM_TR1 (which is by default set to 1) branch, which might mean that some /dev/random thing might be to blame

arnocandel commented 4 years ago
(blah) root@6600f2b85f9e:/tmp# valgrind --trace-children=yes --tool=memcheck python lgb_prefit_df669346-4e47-4ecf-b131-0838ae8f9474.py

shows same thing in Docker:

==5931== Conditional jump or move depends on uninitialised value(s)
==5931==    at 0x5D4A0D1: std::random_device::_M_getval() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==5931==    by 0x3F0EFA40: LightGBM::Random::Random() (in /blah/lib/python3.6/site-packages/lightgbm/lib_lightgbm.so)
==5931==    by 0x3F28F8FD: LightGBM::SerialTreeLearner::SerialTreeLearner(LightGBM::Config const*) (in /blah/lib/python3.6/site-packages/lightgbm/lib_lightgbm.so)
==5931==    by 0x3F2A1F27: LightGBM::TreeLearner::CreateTreeLearner(std::string const&, std::string const&, LightGBM::Config const*) (in /blah/lib/python3.6/site-packages/lightgbm/lib_lightgbm.so)
==5931==    by 0x3F0FF8AA: LightGBM::GBDT::Init(LightGBM::Config const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) (in /blah/lib/python3.6/site-packages/lightgbm/lib_lightgbm.so)
==5931==    by 0x3F0B0F86: LGBM_BoosterCreate (in /blah/lib/python3.6/site-packages/lightgbm/lib_lightgbm.so)
==5931==    by 0x6704FF4: ??? (in /usr/lib/x86_64-linux-gnu/libffi.so.7.1.0)
==5931==    by 0x6704409: ??? (in /usr/lib/x86_64-linux-gnu/libffi.so.7.1.0)
==5931==    by 0x66F0413: _ctypes_callproc (in /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so)
==5931==    by 0x66EF58F: ??? (in /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so)
==5931==    by 0x4C1073: _PyObject_FastCallKeywords (in /usr/bin/python3.6)
==5931==    by 0x5343B5: ??? (in /usr/bin/python3.6)
arnocandel commented 4 years ago

but /dev/random value is probably expected to be uninitialized, so can be ignored. just interesting that (almost) at end of stacktrace of core file.

st-pasha commented 4 years ago

According to https://en.cppreference.com/w/cpp/numeric/random/random_device, the function of the std::random_device is to provide implementation-defined "true" random numbers (the implementation is allowed to fall back to pseudo-randoms if no better alternative is available). On Linux this involves reading from the "random device" file (https://code.woboq.org/gcc/libstdc++-v3/src/c++11/random.cc.html#134 <-- this might not be the same version as used in Ubuntu20).

As such, it is perfectly normal for a random device to use some uninitialized value to further scramble bits of whatever is read from /dev/random. The random device's job is to be as unpredictable as possible.

So I think we shouldn't be worrying too much about this report from valgrind.

st-pasha commented 4 years ago

On the other hand, this part of the core dump I find suspicious:

#4  <signal handler called>
#5  0x0000000000000003 in ?? ()
#6  0x00007f23fa34edca in std::random_device::operator() (this=0x7ffe82ee4fc0) at /usr/include/c++/4.8.2/bits/random.h:1620
#7  LightGBM::Random::Random (this=0x5cea250) at /root/repo/LightGBM/include/LightGBM/utils/random.h:25

As we already know, line 1620 of random.h simply calls this->_M_getval(); (frame 6). However, frame 5 says 0x0000000000000003 in ??, which is not a real address of a function. At the same time the LightGBM source at frame 7 looks perfectly ok (https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/utils/random.h). So I think what happens is that the std::random_device function pointer table is somehow corrupt? Wrong library version is included, or the symbol is somehow not versioned properly across different library versions?

arnocandel commented 4 years ago

yes, that's consistent with another segfault I just got, this time from TF:

#0  0x00007ffdb9a0c8cc in ?? ()
#1  0x00007f418104992f in tensorflow::random::(anonymous namespace)::InitRngWithRandomSeed() ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/../libtensorflow_framework.so
#2  0x00007f4181049c25 in tensorflow::random::New64() ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/../libtensorflow_framework.so
#3  0x00007f4180fb3e6d in tensorflow::Device::BuildDeviceAttributes(std::string const&, tensorflow::DeviceType, tensorflow::gtl::IntType<tensorflow::Bytes_tag_, long long>, tensorflow::DeviceLocality const&, std::string const&) ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/../libtensorflow_framework.so
#4  0x00007f4180fdfa43 in tensorflow::GraphRunner::GraphRunner(tensorflow::Env*) ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/../libtensorflow_framework.so
#5  0x00007f41891d4cc7 in tensorflow::ShapeRefiner::ShapeRefiner(int, tensorflow::OpRegistryInterface const*) ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/_pywrap_tensorflow_internal.so
#6  0x00007f4183e03d6d in TF_Graph::TF_Graph() ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/_pywrap_tensorflow_internal.so
#7  0x00007f4183e03e7e in TF_NewGraph ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/_pywrap_tensorflow_internal.so
#8  0x00007f4183b8cd69 in _wrap_TF_NewGraph ()
   from /home/arno/h2oai/env/lib/python3.6/site-packages/tensorflow_gpu/python/_pywrap_tensorflow_internal.so
st-pasha commented 4 years ago

Based on our investigation, the problem appears to be rooted in Ubuntu 20's system libraries, specifically the std::random_device class which is not versioned properly across different versions of the standard C++ library, even though it has incompatible ABIs.

Since there seems to be nothing to we can do about it within datatable, I'm closing this issue. It should probably be re-raised with either Ubuntu or GCC teams.

jgarvin commented 4 years ago

@st-pasha In a totally separate project I am encountering the same _M_getval error while trying to upgrade GCC. Could you provide any more details about how you figured out it's a library versioning problem? Trying to figure out if this totally sinks my chances of being able to upgrade (because many APIs are affected) or if just coming up with my own way to create a random seed will be a sufficient hack.

sh1ng commented 3 years ago

@st-pasha

I'm facing it on mojo when datatable get installed from pypi(0.11.1) If I install it from https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pydatatable/0.11.0a0.dev193/x86_64-centos7/datatable-0.11.0a0.dev193-cp36-cp36m-linux_x86_64.whl the segfault goes away.

I reopen it as current release of datatable seems incompatible with tf and mojo.

#0  0x000000000000000b in ?? ()
#1  0x00007fffa368d22f in tensorflow::random::(anonymous namespace)::InitRngWithRandomSeed() () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow_framework.so.1
#2  0x00007fffa368d535 in tensorflow::random::New64() () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow_framework.so.1
#3  0x00007fffa35afabd in tensorflow::Device::BuildDeviceAttributes(std::string const&, tensorflow::DeviceType, tensorflow::gtl::IntType<tensorflow::Bytes_tag_, long long>, tensorflow::DeviceLocality const&, std::string const&) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow_framework.so.1
#4  0x00007fffa362e285 in tensorflow::NewSingleThreadedCpuDevice(tensorflow::Env*) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow_framework.so.1
#5  0x00007fffa35e02a4 in tensorflow::GraphRunner::GraphRunner(tensorflow::Env*) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow_framework.so.1
#6  0x00007fff9a9a72e3 in tensorflow::ShapeRefiner::ShapeRefiner(int, tensorflow::OpRegistryInterface const*) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow.so
#7  0x00007fff968db48d in TF_Graph::TF_Graph() () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow.so
#8  0x00007fff968db5ae in TF_NewGraph () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/tf_1.15.5/libtensorflow.so
#9  0x00007ffff43653f9 in mojo::Tf_Scorer::tf_load_model(std::string const&) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#10 0x00007ffff42c424d in mojo::Transform<(mojo::spec::Transformation::TypeCase)34>::Transform(mojo::spec::Transformation const&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::map<std::string, mojo::flex_type_enum, std::less<std::string>, std::allocator<std::pair<std::string const, mojo::flex_type_enum> > > const&, std::map<std::string, unsigned long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long> > > const&, std::string const&, std::map<std::string, mojo::spec::Transformation const*, std::less<std::string>, std::allocator<std::pair<std::string const, mojo::spec::Transformation const*> > > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#11 0x00007ffff43266ff in mojo::trans_gene(mojo::spec::Transformation const&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::map<std::string, mojo::flex_type_enum, std::less<std::string>, std::allocator<std::pair<std::string const, mojo::flex_type_enum> > > const&, std::map<std::string, unsigned long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long> > > const&, std::string const&, std::map<std::string, mojo::spec::Transformation const*, std::less<std::string>, std::allocator<std::pair<std::string const, mojo::spec::Transformation const*> > > const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#12 0x00007ffff4329c56 in mojo::MojoPipeline::init(std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, std::string const&, std::string const&) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#13 0x00007ffff432a5d5 in mojo::MojoPipeline::MojoPipeline(std::string const&, std::string const&, std::string const&) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#14 0x00007ffff41c3f1e in cppmojo::cppmojo(std::string const&, std::string const&) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#15 0x00007ffff41cfc79 in pybind11::cpp_function::initialize<pybind11::detail::initimpl::constructor<std::string const&, std::string const&>::execute<pybind11::class_<cppmojo>, , 0>(pybind11::class_<cppmojo>&)::{lambda(pybind11::detail::value_and_holder&, std::string const&, std::string const&)#1}, void, pybind11::detail::value_and_holder&, std::string const&, std::string const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor>(pybind11::class_<cppmojo>&&, void (*)(pybind11::detail::value_and_holder&, std::string const&, std::string const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#16 0x00007ffff41e0529 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
#17 0x0000555555630b59 in _PyCFunction_FastCallDict (func_obj=func_obj@entry=0x7ffff6b70d38, args=args@entry=0x7fffffffac30, nargs=nargs@entry=3, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:231
#18 0x00005555555db3e9 in _PyObject_FastCallDict (func=0x7ffff6b70d38, args=0x7fffffffac30, nargs=3, kwargs=0x0) at Objects/abstract.c:2313
#19 0x00005555555db515 in _PyObject_Call_Prepend (func=0x7ffff6b70d38, obj=<optimised out>, args=0x7fffd2a0dc08, kwargs=0x0) at Objects/abstract.c:2373
#20 0x00005555555db15e in PyObject_Call (func=0x7ffff6b089c8, args=<optimised out>, kwargs=<optimised out>) at Objects/abstract.c:2261
#21 0x000055555564dd72 in slot_tp_init (self=0x7fffd29eff10, args=0x7fffd2a0dc08, kwds=0x0) at Objects/typeobject.c:6420
#22 0x00005555556454b7 in type_call (type=<optimised out>, args=0x7fffd2a0dc08, kwds=0x0) at Objects/typeobject.c:915
#23 0x00007ffff41de8eb in pybind11_meta_call () from /home/sh1ng/dev/mojo2/mojo_test/lib/python3.6/site-packages/daimojo/cppmojo.cpython-36m-x86_64-linux-gnu.so
sh1ng commented 3 years ago

Let me know if you need any help to reproduce it.

st-pasha commented 3 years ago

@sh1ng We will be making a new release shortly, but other than that, there is nothing that we can fix within datatable to address this problem.

aws-taylor commented 3 years ago

For anyone else who happens to stumble upon this, I ran into a similar/same issue that occurred because tensorflow was binding certain std:: symbols from another library (apache/tvm) instead of libstdc++ as one might expect. In particular, there appears to have been some changes related to the random number logic in more recent libstdc++ versions. I was able to track this down by running my program with env LD_DEBUG=bindings <program> and noticing that some symbols were being bound to the wrong library.

More details: I was building tvm inside a https://github.com/pypa/manylinux docker container. Manylinux uses Redhat's Developer Toolset to maintain compatibility with specific libstdc++ versions. My understanding is that sometimes devtoolset will statically link std:: symbols if such symbols do not exist in libstdc++. It appears some of the libstdc++ features related to random numbers are relatively new and not present in older libstdc++, and I was ending up with std::random_device coming from libtvm.so instead of libstdc++.so. Even this shouldn't happen because the symbols from the libraries should be isolated from each other, but TVM loads itself using RTLD_GLOBAL, polluting the global namespace - https://github.com/apache/tvm/blob/dfe4cebbdadab3d4e6e6ba3951276a51a4ffeaf6/python/tvm/_ffi/base.py#L57.

albertz commented 1 year ago

@aws-taylor I'm also getting a similar error, for PyTorch, on Ubuntu 22.04 (https://github.com/rwth-i6/returnn/issues/1339). Can you provide some details on how to interpret the data from LD_DEBUG=bindings, i.e. what to look for? It's a massive spam of information. I also wonder, as I just use the official Python 3.10 from Ubuntu, and just did pip install torch, so also the official binary, so how can it get broken? I assume there is maybe some other lib loaded in combination which causes this. Because when I import torch directly, it does not crash, only when I run it through pytest, which likely imports a few other things, then the import torch crashes.