Open wphicks opened 4 years ago
Just want to add my error log to this bug. Ran into this issue this week with the test: cuml/test/test_kmeans.py::test_traditional_kmeans_plus_plus_init
. Here is the below stack trace and command I used to run the test after building with ./build.sh -g
.
~/Repos/rapids/cuml-dev2/python$ pytest -s --rootdir $PWD --timeout=3000 --ignore=cuml/pytest_benchmarks --ignore=cuml/raft --ignore=cuml/test/dask --ignore=cuml/test/test_cuml_decorators.py --ignore=cuml/test/test_make_blobs.py --ignore=cuml/test/test_arima.py --ignore=cuml/test/test_benchmark.py cuml/test/test_kmeans.py::test_traditional_kmeans_plus_plus_init
================================================================================ test session starts ================================================================================
platform linux -- Python 3.8.5, pytest-6.1.0, py-1.9.0, pluggy-0.13.1
benchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/mdemoret/Repos/rapids/cuml-dev2/python, configfile: pytest.ini
plugins: asyncio-0.12.0, benchmark-3.2.3, timeout-1.4.2, hypothesis-5.28.0, cov-2.10.1
timeout: 3000.0s
timeout method: signal
timeout func_only: False
collected 20 items
cuml/test/test_kmeans.py terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=/home/mdemoret/Repos/rapids/cuml-dev2/cpp/build/raft/src/raft/cpp/include/raft/mr/host/allocator.hpp line=48: call='cudaFreeHost(p)', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered
Obtained 64 stack frames
#0 in /home/mdemoret/Repos/rapids/cuml-dev2/python/cuml/raft/common/handle.cpython-38-x86_64-linux-gnu.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f3c92f120b6]
#1 in /home/mdemoret/Repos/rapids/cuml-dev2/python/cuml/raft/common/handle.cpython-38-x86_64-linux-gnu.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f3c92f127d9]
#2 in /home/mdemoret/Repos/rapids/cuml-dev2/python/cuml/raft/common/handle.cpython-38-x86_64-linux-gnu.so(_ZN4raft2mr4host17default_allocator10deallocateEPvmP11CUstream_st+0x12c) [0x7f3c92f1294c]
#3 in /home/mdemoret/anaconda3/envs/cuml_dev2/lib/libcuml++.so(_ZN4raft2mr11buffer_baseIfNS0_4host9allocatorEE7releaseEv+0x5d) [0x7f3cbd0ff99f]
#4 in /home/mdemoret/anaconda3/envs/cuml_dev2/lib/libcuml++.so(_ZN4raft2mr11buffer_baseIfNS0_4host9allocatorEED1Ev+0x18) [0x7f3cbd0fdcc0]
#5 in /home/mdemoret/anaconda3/envs/cuml_dev2/lib/libcuml++.so(_ZN4raft2mr4host6bufferIfED2Ev+0x18) [0x7f3cbd100fde]
#6 in /home/mdemoret/anaconda3/envs/cuml_dev2/lib/libcuml++.so(_ZN2ML6kmeans6detail14kmeansPlusPlusIfiEEvRKN4raft8handle_tERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EENS_8Distance12DistanceTypeERNS3_2mr6device6bufferIcEERNSJ_ISB_EEP11CUstream_st+0x1e77) [0x7f3cbd21da1e]
#7 in /home/mdemoret/anaconda3/envs/cuml_dev2/lib/libcuml++.so(_ZN2ML6kmeans4impl18initKMeansPlusPlusIfiEEvRKN4raft8handle_tERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EERNS3_2mr6device6bufferISB_EERNSH_IcEE+0xb2) [0x7f3cbd23c289]
#8 in /home/mdemoret/anaconda3/envs/cuml_dev2/lib/libcuml++.so(_ZN2ML6kmeans4impl3fitIfiEEvRKN4raft8handle_tERKNS0_12KMeansParamsEPKT_iiSC_PSA_RSA_Ri+0xde0) [0x7f3cbd214505]
#9 in /home/mdemoret/anaconda3/envs/cuml_dev2/lib/libcuml++.so(_ZN2ML6kmeans11fit_predictERKN4raft8handle_tERKNS0_12KMeansParamsEPKfiiS9_PfPiRfRi+0x50) [0x7f3cbd1d8742]
#10 in /home/mdemoret/Repos/rapids/cuml-dev2/python/cuml/cluster/kmeans.cpython-38-x86_64-linux-gnu.so(+0x37be0) [0x7f3c92e42be0]
#11 in /home/mdemoret/Repos/rapids/cuml-dev2/python/cuml/cluster/kmeans.cpython-38-x86_64-linux-gnu.so(+0x3ace1) [0x7f3c92e45ce1]
#12 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyObject_MakeTpCall+0x3bf) [0x558e752b34af]
#13 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(+0x167190) [0x558e752ea190]
#14 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x4fa1) [0x558e75365f11]
#15 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x260) [0x558e7534c500]
#16 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#17 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(PyObject_Call+0x2e9) [0x558e752b96a9]
#18 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x1f12) [0x558e75362e82]
#19 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x929) [0x558e7534cbc9]
#20 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#21 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(PyObject_Call+0x7d) [0x558e752b943d]
#22 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x1f12) [0x558e75362e82]
#23 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x929) [0x558e7534cbc9]
#24 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#25 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x15b6) [0x558e75362526]
#26 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x260) [0x558e7534c500]
#27 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#28 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x4fa1) [0x558e75365f11]
#29 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x1b7) [0x558e7534d6a7]
#30 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(+0x16701e) [0x558e752ea01e]
#31 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x4fa1) [0x558e75365f11]
#32 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x260) [0x558e7534c500]
#33 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#34 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyObject_FastCallDict+0xe7) [0x558e752df3f7]
#35 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(+0x196efb) [0x558e75319efb]
#36 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyObject_MakeTpCall+0x3bf) [0x558e752b34af]
#37 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x5492) [0x558e75366402]
#38 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x1b7) [0x558e7534d6a7]
#39 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x558e7536142f]
#40 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x1b7) [0x558e7534d6a7]
#41 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(PyObject_Call+0x7d) [0x558e752b943d]
#42 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x1f12) [0x558e75362e82]
#43 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x929) [0x558e7534cbc9]
#44 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#45 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x15b6) [0x558e75362526]
#46 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x260) [0x558e7534c500]
#47 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#48 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x4fa1) [0x558e75365f11]
#49 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x1b7) [0x558e7534d6a7]
#50 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(+0x16701e) [0x558e752ea01e]
#51 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x4fa1) [0x558e75365f11]
#52 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x260) [0x558e7534c500]
#53 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#54 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyObject_FastCallDict+0xe7) [0x558e752df3f7]
#55 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(+0x196efb) [0x558e75319efb]
#56 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(PyObject_Call+0x452) [0x558e752b9812]
#57 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x1f12) [0x558e75362e82]
#58 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x929) [0x558e7534cbc9]
#59 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#60 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalFrameDefault+0x71a) [0x558e7536168a]
#61 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyEval_EvalCodeWithName+0x260) [0x558e7534c500]
#62 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(_PyFunction_Vectorcall+0x594) [0x558e7534da84]
#63 in /home/mdemoret/anaconda3/envs/cuml_dev2/bin/python(+0x16701e) [0x558e752ea01e]
Fatal Python error: Aborted
Current thread 0x00007f3db8688740 (most recent call first):
File "/home/mdemoret/Repos/rapids/cuml-dev2/python/cuml/test/test_kmeans.py", line 101 in test_traditional_kmeans_plus_plus_init
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/python.py", line 1627 in runtest
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/runner.py", line 256 in <lambda>
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/runner.py", line 310 in from_call
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_and_report
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/runner.py", line 127 in runtestprotocol
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/main.py", line 338 in pytest_runtestloop
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/main.py", line 313 in _main
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/main.py", line 257 in wrap_session
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/mdemoret/anaconda3/envs/cuml_dev2/lib/python3.8/site-packages/_pytest/config/__init__.py", line 164 in main
File "/home/mdemoret/anaconda3/envs/cuml_dev2/bin/pytest", line 11 in <module>
Aborted (core dumped)
I did try adding several CUDA_CHECK(cudaStreamSynchronize(stream));
in kmeansPlusPlus
to get a more precise location of the bug (since I doubt its actually from cudaFreeHost
), but didnt have any luck. I believe it might be thrown from one of the object destructors allocated in the stack in kmeansPlusPlus
. Just food for thought once someone looks into this.
Oh great! Thanks @mdemoret-nv. Having a single pytest invocation as a reproducer will make this quite a bit easier to investigate.
The minimal reproducer that I put in the initial issue report no longer reproduces this on branch-0.17, but the problem still exists. I'm continuing to look into it.
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Hi,
I am opening the ticket again. I think I encountered the same bug in a Kaggle competition. I made a public version of the notebook: https://www.kaggle.com/thomasmeiner/rapids-illegal-memory-access-bug
You can see the error at the very bottom.
The notebook uses RAPIDS embedded in my automl library e2eml. So the source code for this is here: https://github.com/ThomasMeissnerDS/e2e_ml/blob/develop/e2eml/full_processing/cpu_preprocessing.py
In this blueprint the library runs a loop of Kmeans clustering applying different numbers of components in each iteration.
The issue is solved. In my script kmeans tried to fit again during inference. As the batch size was 1, it failed obviously. So the only wish here would be an error message hinting towards that (like in sklearn).
Describe the bug When cuml is built with
./build.sh -g
, and the unit tests are run, multiple tests fail with "illegal memory access" errors.Steps/Code to reproduce bug As far as I can tell, this bug cannot be reproduced by running any single unit test. It only manifests when multiple tests are run together, but it fails every time when the whole suite is run. Shuffling the tests causes them to fail on a new test each time. The minimally reproducible invocation I could find was:
Note that those tests will successfully execute in some other orders (different random seeds).
Expected behavior Debug build successfully executes unit tests.
Environment details: