rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 527 forks source link

[BUG] segfault and hang in testing #2960

Open pseudotensor opened 3 years ago

pseudotensor commented 3 years ago

Describe the bug

Segfault during cuml testing

Steps/Code to reproduce bug

wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.8.3-Linux-x86_64.sh
chmod ug+x Miniconda3-py38_4.8.3-Linux-x86_64.sh
unset PYTHONPATH
./Miniconda3-py38_4.8.3-Linux-x86_64.sh -b -p $HOME/miniconda3'
eval "$($HOME/miniconda/bin/conda shell.bash hook)"
conda install -c rapidsai -c nvidia -c conda-forge cudf=0.15 cuml=0.15 python=3.8 nomkl cudatoolkit=10.2 cusignal==0.15 custreamz==0.15 matplotlib scikit-learn pandas numpy cugraph=0.15 networkx scipy umap-learn statsmodels daks-ml -y
conda install -c rapidsai -c nvidia nvtabular
git clone https://github.com/rapidsai/cuml.git
cd cuml
git checkout branch-0.15
git submodule update --init --remote --recursive
cd cuml ; ln -sf python/cuml/test . ; cd .. ; pytest -s -v cuml/test

Many tests pass, but then hits:

Rank 1: Performing Broadcast
Root Rank is 0
Rank 0: Performing Broadcast
Rank 0: Exchanging results
Rank 1: Performing Local KNN
Rank 1: Exchanging results
Root Rank is 1
Rank 1: Performing Broadcast
Rank 1: Performing Local KNN
Rank 0: Performing Reduce
Rank 0: Finished Reduce
Rank 0: Performing Broadcast
Rank 0: Exchanging results
[1602534898.995053] [mr-dl10:19191:0]           sock.c:344  UCX  ERROR sendv(fd=79) failed: Bad address
distributed.worker - WARNING -  Compute Failed
Function:  _func_predict
args:      (KNeighborsClassifierMG(batch_size=256), [], [(1, 335), (1, 335)], 670, [array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])], [(0, 165), (1, 165)], 330, array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]]), [5, 5], 30, 0, True, False)
kwargs:    {}
Exception: RuntimeError('Exception occured! file=/opt/conda/envs/rapids/conda-bld/libcuml_1598469363086/work/cpp/comms/std/src/ucp_helper.h line=194: unable to send UCX data message (-3)\n\nObtained 32 stack frames\n#0 in /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon9Exception16collectCallStackEv+0x3e) [0x7f39f2af366e]\n#1 in /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon9ExceptionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71) [0x7f39f2af41e1]\n#2 in /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcumlcomms.so(_ZNK2ML24cumlStdCommunicator_impl5isendEPKviiiPj+0x13f5) [0x7f3badaac605]\n#3 in /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3KNN3opg10knn_common16exchange_resultsIiEEvRN8MLCommon11buffer_baseIT_NS4_15deviceAllocatorEEERNS5_IlS7_EERNS5_IfS7_EERKNS4_16cumlCommunicatorEiSt3setIiSt4lessIiESaIiEEP11CUstream_stmiii+0x332) [0x7f39f2f16a12]\n#4 in /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3KNN3opg10knn_common7opg_knnIiEEvRNS_10cumlHandleEPSt6vectorIPN8MLCommon6Matrix4DataIT_EESaISC_EEPS6_IPNS9_IlEESaISH_EEPS6_IPNS9_IfEESaISM_EERSO_RNS8_14PartDescriptorESQ_SS_RS6_IS6_IPSA_SaIST_EESaISV_EEbbiimbPS6_IS6_IPfSaISZ_EESaIS11_EEPS6_IPiSaIS15_EEPS6_IiSaIiEEb+0xe15) [0x7f39f2f0fcf5]\n#5 in /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3KNN3opg12knn_classifyERNS_10cumlHandleEPSt6vectorIPN8MLCommon6Matrix4DataIiEESaIS9_EEPS4_IPNS7_IlEESaISE_EEPS4_IPNS7_IfEESaISJ_EEPS4_IS4_IPfSaISN_EESaISP_EERSL_RNS6_14PartDescriptorEST_SV_RS4_IS4_IPiSaISW_EESaISY_EERSY_RS4_IiSaIiEEbbbimb+0x75) [0x7f39f2f1b7e5]\n#6 in /home/jon/miniconda3/lib/python3.8/site-packages/cuml/neighbors/kneighbors_classifier_mg.cpython-38-x86_64-linux-gnu.so(+0x13f63) [0x7f3b827a8f63]\n#7 in /home/jon/miniconda3/bin/python(_PyObject_MakeTpCall+0x22f) [0x561eff60a50f]\n#8 in /home/jon/miniconda3/bin/python(+0x18bca1) [0x561eff658ca1]\n#9 in /home/jon/miniconda3/bin/python(+0xffbfd) [0x561eff5ccbfd]\n#10 in /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]\n#11 in /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]\n#12 in /home/jon/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x2003) [0x561eff690733]\n#13 in /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]\n#14 in /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]\n#15 in /home/jon/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x2003) [0x561eff690733]\n#16 in /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]\n#17 in /home/jon/miniconda3/bin/python(+0xffb08) [0x561eff5ccb08]\n#18 in /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]\n#19 in /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]\n#20 in /home/jon/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x2003) [0x561eff690733]\n#21 in /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]\n#22 in /home/jon/miniconda3/bin/python(+0xffb08) [0x561eff5ccb08]\n#23 in /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]\n#24 in /home/jon/miniconda3/bin/python(+0xffb08) [0x561eff5ccb08]\n#25 in /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]\n#26 in /home/jon/miniconda3/bin/python(+0x18bbe7) [0x561eff658be7]\n#27 in /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]\n#28 in /home/jon/miniconda3/bin/python(+0x23571e) [0x561eff70271e]\n#29 in /home/jon/miniconda3/bin/python(+0x1e2408) [0x561eff6af408]\n#30 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f3be6f3e6db]\n#31 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f3be6c67a3f]\n')

Rank 1: Exchanging results
Rank 1: Performing Reduce
Rank 1: Finished Reduce
FAILED
cuml/test/dask/test_kneighbors_classifier.py::test_predict[dataset0-256-None-6-dask_cudf] [W] [13:34:59.340402] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:34:59.341179] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:34:59.343318] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:34:59.513772] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
Rank 1: Performing Broadcast
Root Rank is 0
Rank 0: Performing Broadcast
Rank 0: Performing Local KNN
Rank 1: Performing Local KNN
Rank 1: Exchanging results
Root Rank is 1
Rank 1: Performing Broadcast
Rank 0: Exchanging results
Rank 0: Performing Reduce
Rank 0: Finished Reduce
Rank 0: Performing Broadcast
Rank 1: Performing Local KNN
Rank 0: Performing Local KNN
Rank 1: Exchanging results
Rank 0: Exchanging results
Rank 1: Performing Reduce
Rank 1: Finished Reduce
PASSED
cuml/test/dask/test_kneighbors_classifier.py::test_predict[dataset0-256-2-1-dask_array] [W] [13:35:00.678495] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:35:00.679305] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:35:00.681717] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:35:00.860252] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:35:01.318088] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:35:01.318270] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [13:35:01.319033] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
Root Rank is 0
Rank 0: Performing Broadcast
[W] [13:35:01.320128] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
Rank 1: Performing Broadcast
Rank 0: Exchanging results
Rank 1: Performing Local KNN
Rank 1: Exchanging results
Root Rank is 1
Rank 1: Performing Broadcast
Rank 1: Performing Local KNN
Rank 0: Performing Reduce
Rank 0: Finished Reduce
Rank 0: Performing Broadcast
Rank 0: Exchanging results
[mr-dl10:19191:0:19280] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:  19280) ====
 0  /home/jon/miniconda3/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x12d) [0x7f3b82d6850d]
 1  /home/jon/miniconda3/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x23951) [0x7f3b82d68951]
 2  /home/jon/miniconda3/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x23b22) [0x7f3b82d68b22]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7f3be6f498a0]
 4  /lib/x86_64-linux-gnu/libc.so.6(+0xbb734) [0x7f3be6c01734]
 5  /home/jon/miniconda3/lib/python3.8/site-packages/ucp/_libs/../../../.././libuct.so.0(uct_tcp_ep_am_short+0x175) [0x7f3b82d27165]
 6  /home/jon/miniconda3/lib/python3.8/site-packages/ucp/_libs/../../../../libucp.so.0(+0x3b3db) [0x7f3b82dc83db]
 7  /home/jon/miniconda3/lib/python3.8/site-packages/ucp/_libs/../../../../libucp.so.0(ucp_tag_send_nb+0x5e4) [0x7f3b82ddab94]
 8  /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcumlcomms.so(_ZNK2ML24cumlStdCommunicator_impl5isendEPKviiiPj+0x701) [0x7f3badaab911]
 9  /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3KNN3opg10knn_common16exchange_resultsIiEEvRN8MLCommon11buffer_baseIT_NS4_15deviceAllocatorEEERNS5_IlS7_EERNS5_IfS7_EERKNS4_16cumlCommunicatorEiSt3setIiSt4lessIiESaIiEEP11CUstream_stmiii+0x332) [0x7f39f2f16a12]
10  /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3KNN3opg10knn_common7opg_knnIiEEvRNS_10cumlHandleEPSt6vectorIPN8MLCommon6Matrix4DataIT_EESaISC_EEPS6_IPNS9_IlEESaISH_EEPS6_IPNS9_IfEESaISM_EERSO_RNS8_14PartDescriptorESQ_SS_RS6_IS6_IPSA_SaIST_EESaISV_EEbbiimbPS6_IS6_IPfSaISZ_EESaIS11_EEPS6_IPiSaIS15_EEPS6_IiSaIiEEb+0xe15) [0x7f39f2f0fcf5]
11  /home/jon/miniconda3/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3KNN3opg12knn_classifyERNS_10cumlHandleEPSt6vectorIPN8MLCommon6Matrix4DataIiEESaIS9_EEPS4_IPNS7_IlEESaISE_EEPS4_IPNS7_IfEESaISJ_EEPS4_IS4_IPfSaISN_EESaISP_EERSL_RNS6_14PartDescriptorEST_SV_RS4_IS4_IPiSaISW_EESaISY_EERSY_RS4_IiSaIiEEbbbimb+0x75) [0x7f39f2f1b7e5]
12  /home/jon/miniconda3/lib/python3.8/site-packages/cuml/neighbors/kneighbors_classifier_mg.cpython-38-x86_64-linux-gnu.so(+0x13f63) [0x7f3b827a8f63]
13  /home/jon/miniconda3/bin/python(_PyObject_MakeTpCall+0x22f) [0x561eff60a50f]
14  /home/jon/miniconda3/bin/python(+0x18bca1) [0x561eff658ca1]
15  /home/jon/miniconda3/bin/python(+0xffbfd) [0x561eff5ccbfd]
16  /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]
17  /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]
18  /home/jon/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x2003) [0x561eff690733]
19  /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]
20  /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]
21  /home/jon/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x2003) [0x561eff690733]
22  /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]
23  /home/jon/miniconda3/bin/python(+0xffb08) [0x561eff5ccb08]
24  /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]
25  /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]
26  /home/jon/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x2003) [0x561eff690733]
27  /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]
28  /home/jon/miniconda3/bin/python(+0xffb08) [0x561eff5ccb08]
29  /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]
30  /home/jon/miniconda3/bin/python(+0xffb08) [0x561eff5ccb08]
31  /home/jon/miniconda3/bin/python(_PyFunction_Vectorcall+0x10b) [0x561eff65856b]
32  /home/jon/miniconda3/bin/python(+0x18bbe7) [0x561eff658be7]
33  /home/jon/miniconda3/bin/python(PyVectorcall_Call+0x71) [0x561eff609cf1]
34  /home/jon/miniconda3/bin/python(+0x23571e) [0x561eff70271e]
35  /home/jon/miniconda3/bin/python(+0x1e2408) [0x561eff6af408]
36  /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f3be6f3e6db]
37  /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f3be6c67a3f]
=================================
distributed.nanny - WARNING - Restarting worker
Rank 1: Exchanging results
Rank 1: Performing Reduce
Rank 1: Finished Reduce

^Cdistributed.nanny - WARNING - Restarting worker
^C

I hit CTRL-C because it hung for more than hour.

Expected behavior

Tests should not segfault or hang.

Environment details (please complete the following information):

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 108... On | 00000000:02:00.0 On | N/A | | 23% 38C P8 17W / 250W | 650MiB / 11178MiB | 18% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... On | 00000000:81:00.0 Off | N/A | | 20% 44C P8 10W / 250W | 2MiB / 11178MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

conda-list.txt

viclafargue commented 3 years ago

Thank you for opening the issue. The MNMG KNN Classifier & Regressor models are known to have bugs in 0.15. These should hopefully be fixed thanks to the comprehensive bug fix #2844 that will be shipped in the 0.16 release. Would be interesting to check that everything works fine with 0.16 binaries available from rapidsai-nightly channel though.

pseudotensor commented 3 years ago

Hi @viclafargue , thanks.

What is the release strategy for RAPIDS? Bugs will never be fixed in prior versions? The problem with such a strategy is that new features in new releases may cause new bugs overall and break things, so it might be safer to fix minor bugs in the prior release so RAPIDS is a stable platform. So it might leave RAPIDS in an always unstable state.

i.e. some kind of LTSness at a basic level would be good, so some particular version is stable without clear failures.

viclafargue commented 3 years ago

That's a good point. It might indeed be interesting to apply patches and have long term support for specific versions every few release cycles. Especially for people/companies looking for stability to host their services that make use of RAPIDS software. I don't know if RAPIDS is willing to put a lot of efforts on this as it is still in growth phase though. Tagging @JohnZed

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.