rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[BUG] nn,kneighbors CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #5570

Open phact opened 1 year ago

phact commented 1 year ago

Describe the bug

When running kneighbors (brute force algo, two_pass_precission=True) against a 2M record dataset (searching all 2M records in the training set) it blows up with:

CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

I'm using rmm :

import rmm
rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

Full stack trace:

[2156337 rows x 384 columns]
Traceback (most recent call last):
  File "/cuMLKNN.py", line 67, in <module>
    distances, indices = nn.kneighbors(sample_df, two_pass_precision=True)
  File "/venv/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
  File "venv/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
  File "base.pyx", line 665, in cuml.internals.base.UniversalBase.dispatch_func
  File "nearest_neighbors.pyx", line 535, in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors
  File "nearest_neighbors.pyx", line 651, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_internal
  File "venv/lib/python3.10/site-packages/cupy/_sorting/sort.py", line 116, in argsort
    return a.argsort(axis=axis)
  File "cupy/_core/core.pyx", line 874, in cupy._core.core._ndarray_base.argsort
  File "cupy/_core/core.pyx", line 891, in cupy._core.core._ndarray_base.argsort
  File "cupy/_core/_routines_sorting.pyx", line 88, in cupy._core._routines_sorting._ndarray_argsort
  File "cupy/_core/core.pyx", line 611, in cupy._core.core._ndarray_base.copy
  File "cupy/_core/core.pyx", line 570, in cupy._core.core._ndarray_base.astype
  File "cupy/_core/core.pyx", line 132, in cupy._core.core.ndarray.__new__
  File "cupy/_core/core.pyx", line 220, in cupy._core.core._ndarray_base._init
  File "cupy/cuda/memory.pyx", line 740, in cupy.cuda.memory.alloc
  File "venv/lib/python3.10/site-packages/rmm/allocators/cupy.py", line 37, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
  File "device_buffer.pyx", line 85, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/managed_memory_resource.hpp:74: cudaErrorIllegalAddress an illegal memory access was encountered
CUDA call='cudaEventDestroy(event_)' at file=/__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/_deps/raft-src/cpp/include/raft/core/resource/cuda_event.hpp line=33 failed with an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Error in sys.excepthook:

Environment details (please complete the following information): Environment location: Bare-metal Linux Distro/Architecture: Ubuntu 22.04 amd64 GPU Model/Driver: NVIDIA RTX A5500 Laptop GPU CUDA: 12.2 Method of cuDF & cuML install: cuda toolkit from repo

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda-repo-ubuntu2204-12-2-local_12.2.2-535.104.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-2-local_12.2.2-535.104.05-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
cjnolet commented 1 year ago

Thanks for opening issue about this , @phact. Can you provide the whole script you are running so we have a minimal reproducible example (MRE) to reproduce the issue on our side?