rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[TRACKER][BUG] Integer-based indexing causing failures when data grows very large. #2459

Open cjnolet opened 4 years ago

cjnolet commented 4 years ago

Below is a reproducible example on the 0.15 nightly of cuML w/ CUDA 10.2.

There are several estimators that fail with either illegal memory access was encountered or invalid configuration argument, both of which seem to indicate an integer overflow might be occurring when representing the size of the underlying array.

import rmm
rmm.reinitialize(managed_memory=True, pool_allocator=False)
import cupy as cp
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
import numpy as np
a = np.random.random((1000000, 5000))
from cuml.decomposition import PCA
pca = PCA()
pca.fit(a)

Here's the exception

[W] [10:33:38.606739] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cuml/decomposition/pca.pyx", line 394, in cuml.decomposition.pca.PCA.fit
RuntimeError: Exception occured! file=/conda/conda-bld/libcuml_1591208799167/work/cpp/src_prims/common/buffer_base.hpp line=55: FAIL: call='cudaStreamSynchronize(_stream)'. Reason:an illegal memory access was encountered
Obtained 19 stack frames
#0 in /raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/common/pointer_utils.cpython-37m-x86_64-linux-gnu.so(_ZN8MLCommon9Exception16collectCallStackEv+0x3e) [0x7fb84624613e]
#1 in /raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/common/pointer_utils.cpython-37m-x86_64-linux-gnu.so(_ZN8MLCommon9ExceptionC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x80) [0x7fb846246c50]
#2 in /raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon11buffer_baseIdNS_15deviceAllocatorEEC2ESt10shared_ptrIS1_EP11CUstream_stm+0x164) [0x7fade2e2d314]
#3 in /raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6pcaFitIdEEvRKNS_15cumlHandle_implEPT_S5_S5_S5_S5_S5_S5_RKNS_9paramsPCAEP11CUstream_st+0x103) [0x7fade30da133]
#4 in /raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6pcaFitERNS_10cumlHandleEPdS2_S2_S2_S2_S2_S2_RKNS_9paramsPCAE+0x5a) [0x7fade30cf07a]
#5 in /raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x12a8f) [0x7fb843a62a8f]
#6 in python(_PyObject_FastCallKeywords+0x15c) [0x560920306bec]
#7 in python(+0x181661) [0x560920307661]
#8 in python(_PyEval_EvalFrameDefault+0x48a2) [0x56092034d762]
#9 in python(_PyEval_EvalCodeWithName+0x255) [0x56092029f505]
#10 in python(PyEval_EvalCode+0x23) [0x5609202a08f3]
#11 in python(+0x227692) [0x5609203ad692]
#12 in python(+0xf0758) [0x560920276758]
#13 in python(PyRun_InteractiveLoopFlags+0xeb) [0x5609202768fd]
#14 in python(+0xf0998) [0x560920276998]
#15 in python(+0xf12ac) [0x5609202772ac]
#16 in python(_Py_UnixMain+0x3c) [0x5609203b8aec]
#17 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fb870cc1b97]
#18 in python(+0x1d773d) [0x56092035d73d]

In the above example, I used a size of 1Mx5000, which is >> 2^31 number of samples. I also tried with 1Mx2500, which is also > 2^31.

Just to rule out the possibility that this error only occurs in the case of oversubscribing the GPU memory, I tried with 1Mx2100, (which is <2^31 but still requires >32gb of GPU memory to train):

>>> import rmm
>>> rmm.reinitialize(managed_memory=True, pool_allocator=False)
0
>>> import numpy as np
>>> a = np.random.random((1000000, 2100))
>>> import cupy as cp
>>> cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
>>> from cuml.decomposition import PCA
>>> pca = PCA()
>>> pca.fit(a)
[W] [10:46:02.336113] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
PCA(copy=True, handle=<cuml.common.handle.Handle object at 0x7f21bafe3ed0>, iterated_power=15, n_components=1, random_state=None, svd_solver='auto', tol=1e-07, verbose=2, whiten=False, output_type='numpy')

As a result of the behavior outlined above, I have a very strong suspicion the are integers being used to represent array sizes that should be promoted to size_t. This should also affect TruncatedSVD.

Keeping a list to track the places where this occurs so far:

cjnolet commented 4 years ago

Logistic regression seems to throw an error before the algorithm even executes and it seems to be from CuPy:

a = np.random.random((1000000, 2100))
from cuml.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(a, a[:,0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 56, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "cuml/linear_model/logistic_regression.pyx", line 262, in cuml.linear_model.logistic_regression.LogisticRegression.fit
  File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cupy/manipulation/add_remove.py", line 74, in unique
    mask = cupy.empty(aux.shape, dtype=cupy.bool_)
  File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cupy/creation/basic.py", line 22, in empty
    return cupy.ndarray(shape, dtype, order=order)
  File "cupy/core/core.pyx", line 137, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 569, in cupy.cuda.memory.alloc
  File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/rmm/rmm.py", line 347, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
  File "rmm/_lib/device_buffer.pyx", line 79, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: /conda/conda-bld/librmm_1591196517113/work/include/rmm/mr/device/managed_memory_resource.hpp70: cudaErrorIllegalAddress an illegal memory access was encountered
cjnolet commented 4 years ago

I executed the 1Mx5k training and inference on several more models:

Failed:

Did not fail:

cjnolet commented 4 years ago

I believe the solution for most of these should be to convert any and all variables representing number of array elements (e.g., n_rows * n_cols) to size_t.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

cjnolet commented 3 years ago

Testing these on 2108 w/ 1Mx2.1k (float32):

cjnolet commented 3 years ago

I really thought PCA/TSVD had been fixed in 21.08 but it does appear the issue still lingers:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_39624/3959784562.py in <module>
----> 1 PCA(n_components=50).fit_transform(x)

~/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner(*args, **kwargs)
    593 
    594                 # Call the function
--> 595                 ret_val = func(*args, **kwargs)
    596 
    597             return cm.process_return(ret_val)

cuml/decomposition/pca.pyx in cuml.decomposition.pca.PCA.fit_transform()

~/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, **kwargs)
    407                                 target_val=target_val)
    408 
--> 409                 return func(*args, **kwargs)
    410 
    411         @wraps(func)

cuml/decomposition/pca.pyx in cuml.decomposition.pca.PCA.fit()

RuntimeError: CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/mr/buffer_base.hpp line=68: call='cudaStreamSynchronize(stream_)', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered
Obtained 64 stack frames
#0 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7fc6049839ab]
#1 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft10cuda_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5a) [0x7fc60498413a]
#2 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6pcaFitIfEEvRKN4raft8handle_tEPT_S6_S6_S6_S6_S6_S6_RKNS_17paramsPCATemplateINS_6solverEEEP11CUstream_st+0x260) [0x7fc604d99a80]
#3 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6pcaFitERN4raft8handle_tEPfS3_S3_S3_S3_S3_S3_RKNS_17paramsPCATemplateINS_6solverEEE+0x1b) [0x7fc604d8efcb]
#4 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x31836) [0x7fc64476c836]
#5 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyObject_Call+0x66) [0x55effd6bd7b6]
#6 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x1d0d) [0x55effd767a6d]
#7 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0xc5c) [0x55effd6bc59c]
#8 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyEval_EvalCodeEx+0x3c) [0x55effd6125fc]
#9 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x2aeba) [0x7fc644765eba]
#10 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x2bb7f) [0x7fc644766b7f]
#11 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x2cd3a) [0x7fc644767d3a]
#12 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyObject_Call+0x66) [0x55effd6bd7b6]
#13 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x1d0d) [0x55effd767a6d]
#14 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0xc5c) [0x55effd6bc59c]
#15 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x693) [0x55effd6dc223]
#16 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#17 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x48a2) [0x55effd76a602]
#18 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0x273) [0x55effd6bbbb3]
#19 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1d751e) [0x55effd77a51e]
#20 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9) [0x55effd6ec959]
#21 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x44f8) [0x55effd76a258]
#22 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#23 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#24 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#25 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#26 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#27 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyMethodDescr_FastCallKeywords+0xdb) [0x55effd72246b]
#28 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1801ae) [0x55effd7231ae]
#29 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]
#30 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#31 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x3f5) [0x55effd766155]
#32 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#33 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#34 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]
#35 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0x273) [0x55effd6bbbb3]
#36 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyObject_FastCallDict+0x5be) [0x55effd6bd4ae]
#37 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x12f041) [0x55effd6d2041]
#38 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyObject_Call+0x66) [0x55effd6bd7b6]
#39 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x1d0d) [0x55effd767a6d]
#40 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0x79e) [0x55effd6bc0de]
#41 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x693) [0x55effd6dc223]
#42 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#43 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x145c) [0x55effd7671bc]
#44 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#45 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#46 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#47 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#48 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#49 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#50 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#51 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#52 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#53 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/lib-dynload/_asyncio.cpython-37m-x86_64-linux-gnu.so(+0xadb9) [0x7fcc80383db9]
#54 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyObject_FastCallKeywords+0x15c) [0x55effd72288c]
#55 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x212801) [0x55effd7b5801]
#56 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyMethodDef_RawFastCallDict+0x193) [0x55effd6ec663]
#57 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x55ab) [0x55effd76b30b]
#58 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#59 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#60 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]
#61 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#62 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#63 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]

Definitely still seems to be an overflow somewhere. I'm guessing in a 32-bit int.

wphicks commented 2 years ago

Related: https://github.com/rapidsai/cuml/issues/4105