Open cjnolet opened 4 years ago
Logistic regression seems to throw an error before the algorithm even executes and it seems to be from CuPy:
a = np.random.random((1000000, 2100))
from cuml.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(a, a[:,0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 56, in cupy_rmm_wrapper
return func(*args, **kwargs)
File "cuml/linear_model/logistic_regression.pyx", line 262, in cuml.linear_model.logistic_regression.LogisticRegression.fit
File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cupy/manipulation/add_remove.py", line 74, in unique
mask = cupy.empty(aux.shape, dtype=cupy.bool_)
File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/cupy/creation/basic.py", line 22, in empty
return cupy.ndarray(shape, dtype, order=order)
File "cupy/core/core.pyx", line 137, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 569, in cupy.cuda.memory.alloc
File "/raid/cnolet/miniconda3/envs/rapidgenomics2/lib/python3.7/site-packages/rmm/rmm.py", line 347, in rmm_cupy_allocator
buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
File "rmm/_lib/device_buffer.pyx", line 79, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: /conda/conda-bld/librmm_1591196517113/work/include/rmm/mr/device/managed_memory_resource.hpp70: cudaErrorIllegalAddress an illegal memory access was encountered
I executed the 1Mx5k
training and inference on several more models:
Failed:
Did not fail:
I believe the solution for most of these should be to convert any and all variables representing number of array elements (e.g., n_rows * n_cols) to size_t
.
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
Testing these on 2108 w/ 1Mx2.1k (float32):
I really thought PCA/TSVD had been fixed in 21.08 but it does appear the issue still lingers:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_39624/3959784562.py in <module>
----> 1 PCA(n_components=50).fit_transform(x)
~/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner(*args, **kwargs)
593
594 # Call the function
--> 595 ret_val = func(*args, **kwargs)
596
597 return cm.process_return(ret_val)
cuml/decomposition/pca.pyx in cuml.decomposition.pca.PCA.fit_transform()
~/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, **kwargs)
407 target_val=target_val)
408
--> 409 return func(*args, **kwargs)
410
411 @wraps(func)
cuml/decomposition/pca.pyx in cuml.decomposition.pca.PCA.fit()
RuntimeError: CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/mr/buffer_base.hpp line=68: call='cudaStreamSynchronize(stream_)', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered
Obtained 64 stack frames
#0 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7fc6049839ab]
#1 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft10cuda_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5a) [0x7fc60498413a]
#2 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6pcaFitIfEEvRKN4raft8handle_tEPT_S6_S6_S6_S6_S6_S6_RKNS_17paramsPCATemplateINS_6solverEEEP11CUstream_st+0x260) [0x7fc604d99a80]
#3 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6pcaFitERN4raft8handle_tEPfS3_S3_S3_S3_S3_S3_RKNS_17paramsPCATemplateINS_6solverEEE+0x1b) [0x7fc604d8efcb]
#4 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x31836) [0x7fc64476c836]
#5 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyObject_Call+0x66) [0x55effd6bd7b6]
#6 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x1d0d) [0x55effd767a6d]
#7 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0xc5c) [0x55effd6bc59c]
#8 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyEval_EvalCodeEx+0x3c) [0x55effd6125fc]
#9 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x2aeba) [0x7fc644765eba]
#10 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x2bb7f) [0x7fc644766b7f]
#11 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/site-packages/cuml/decomposition/pca.cpython-37m-x86_64-linux-gnu.so(+0x2cd3a) [0x7fc644767d3a]
#12 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyObject_Call+0x66) [0x55effd6bd7b6]
#13 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x1d0d) [0x55effd767a6d]
#14 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0xc5c) [0x55effd6bc59c]
#15 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x693) [0x55effd6dc223]
#16 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#17 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x48a2) [0x55effd76a602]
#18 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0x273) [0x55effd6bbbb3]
#19 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1d751e) [0x55effd77a51e]
#20 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9) [0x55effd6ec959]
#21 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x44f8) [0x55effd76a258]
#22 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#23 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#24 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#25 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#26 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#27 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyMethodDescr_FastCallKeywords+0xdb) [0x55effd72246b]
#28 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1801ae) [0x55effd7231ae]
#29 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]
#30 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#31 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x3f5) [0x55effd766155]
#32 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#33 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#34 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]
#35 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0x273) [0x55effd6bbbb3]
#36 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyObject_FastCallDict+0x5be) [0x55effd6bd4ae]
#37 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x12f041) [0x55effd6d2041]
#38 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(PyObject_Call+0x66) [0x55effd6bd7b6]
#39 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x1d0d) [0x55effd767a6d]
#40 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalCodeWithName+0x79e) [0x55effd6bc0de]
#41 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x693) [0x55effd6dc223]
#42 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#43 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x145c) [0x55effd7671bc]
#44 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#45 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#46 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#47 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#48 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#49 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#50 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#51 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x19f1) [0x55effd767751]
#52 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x17f0b4) [0x55effd7220b4]
#53 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/lib/python3.7/lib-dynload/_asyncio.cpython-37m-x86_64-linux-gnu.so(+0xadb9) [0x7fcc80383db9]
#54 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyObject_FastCallKeywords+0x15c) [0x55effd72288c]
#55 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x212801) [0x55effd7b5801]
#56 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyMethodDef_RawFastCallDict+0x193) [0x55effd6ec663]
#57 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x55ab) [0x55effd76b30b]
#58 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#59 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#60 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]
#61 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyFunction_FastCallKeywords+0x187) [0x55effd6dbd17]
#62 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(+0x1800c5) [0x55effd7230c5]
#63 in /home/nfs/cnolet/miniconda3_2/envs/rapidgenomics_2108/bin/python(_PyEval_EvalFrameDefault+0x621) [0x55effd766381]
Definitely still seems to be an overflow somewhere. I'm guessing in a 32-bit int.
Below is a reproducible example on the 0.15 nightly of cuML w/ CUDA 10.2.
There are several estimators that fail with either
illegal memory access was encountered
orinvalid configuration argument
, both of which seem to indicate an integer overflow might be occurring when representing the size of the underlying array.Here's the exception
In the above example, I used a size of
1Mx5000
, which is >>2^31
number of samples. I also tried with1Mx2500
, which is also >2^31
.Just to rule out the possibility that this error only occurs in the case of oversubscribing the GPU memory, I tried with
1Mx2100
, (which is<2^31
but still requires>32gb
of GPU memory to train):As a result of the behavior outlined above, I have a very strong suspicion the are integers being used to represent array sizes that should be promoted to
size_t
. This should also affectTruncatedSVD
.Keeping a list to track the places where this occurs so far: