rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[BUG] Transient illegal memory access when performing umap.transform() #5690

Open jiho opened 9 months ago

jiho commented 9 months ago

Describe the bug

I am fitting many (hundreds) umap models to various subsets of several datasets and then transforming the full dataset (2M points) into the reduced space. When transforming I sometimes get the following error:

Traceback (most recent call last):                                                                                                                                                                                                     
  File "/remote/complex/home/jiho/gdrive/shared/Proj_AtlantECO/UVP_morphospace/morphopart/./explore_params.py", line 101, in <module>                                                                                                  
    f_all_reduced = transform_features(f_all, dimred, params[step_params], log)                                                                                                                                                        
  File "/remote/complex/home/jiho/gdrive/shared/Proj_AtlantECO/UVP_morphospace/morphopart/morphopart.py", line 332, in transform_features                                                                                              
    f_all_reduced = [dimred['dim_reducer'].transform(chunk) for chunk in f_all_scaled]                                                                                                                                                 
  File "/remote/complex/home/jiho/gdrive/shared/Proj_AtlantECO/UVP_morphospace/morphopart/morphopart.py", line 332, in <listcomp>                                                                                                      
    f_all_reduced = [dimred['dim_reducer'].transform(chunk) for chunk in f_all_scaled]                                                                                                                                                 
  File "/home/jiho/.miniconda3/envs/rapids_12/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper                                                                                                     
    ret = func(*args, **kwargs)                                                                                                                                                                                                        
  File "/home/jiho/.miniconda3/envs/rapids_12/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch                                                                                                    
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)                                                                                                                                                                    
  File "/home/jiho/.miniconda3/envs/rapids_12/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper                                                                                                     
    return func(*args, **kwargs)                                                                                                                                                                                                       
  File "base.pyx", line 674, in cuml.internals.base.UniversalBase.dispatch_func                                                                                                                                                        
  File "umap.pyx", line 771, in cuml.manifold.umap.UMAP.transform                                                                                                                                                                      
  File "handle.pyx", line 117, in pylibraft.common.handle.DeviceResources.sync                                                                                                                                                         
RuntimeError: CUDA error encountered at: file=/home/jiho/.miniconda3/envs/rapids_12/include/raft/core/interruptible.hpp line=301:                                                                                                      
CUDA call='cudaEventDestroy(event_)' at file=/home/jiho/.miniconda3/envs/rapids_12/include/raft/core/resource/cuda_event.hpp line=33 failed with an illegal memory access was encountered                                              
CUDA call='cudaEventDestroy(event_)' at file=/home/jiho/.miniconda3/envs/rapids_12/include/raft/core/resource/cuda_event.hpp line=33 failed with an illegal memory access was encountered                                              
Traceback (most recent call last):                                                                                                                                                                                                     
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload                                                                                                                                    
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status                                                                                                                                     
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered                                                                                                                    
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                                     
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload                                                                                                                                    
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status                                                                                                                                     

and then about a dozen repetitions of:

cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered                                                                                                                    
Traceback (most recent call last):                                                                                                                                                                                                     
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload                                                                                                                                    
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status 

The occurrence of these errors seems quite random: sometimes I get them after 50 fits+transform, sometimes after hundreds. It does not seem related to the nature/content of the data (after re-launch it runs fine on a dataset combination it just failed on).

The memory usage reported by nvtop or nvidia-smi is always reasonnable (at 2 to 10GB out of 48).

Steps/Code to reproduce bug

Since this is memory related I tried to perform the transformation in smaller chunks of data so the code looks like

import cuml
import numpy as np
...
model = cuml.UMAP(n_components=4)
model.fit(data_subset)
full_data = np.vsplit(full_data, 10)
full_data_transformed = [model.transform(chunk) for chunk in full_data]
full_data_transformed = np.vstack(full_data_transformed)

After each fit+transform combo, I also added rmm.reinitialize() (which seemed to help: less frequent errors; but it may have been something external too).

The full code is there https://github.com/jiho/morphopart; functions in morphopart.py, loop in explore_params.py.

Expected behavior I would expect all transformations to run the same, without error.

Environment details

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
alsa-lib                  1.2.10               hd590300_0    conda-forge
asttokens                 2.0.5              pyhd3eb1b0_0  
attr                      2.5.1                h166bdaf_1    conda-forge
aws-c-auth                0.7.4                hc8144f4_1    conda-forge
aws-c-cal                 0.6.2                h09139f6_2    conda-forge
aws-c-common              0.9.3                hd590300_0    conda-forge
aws-c-compression         0.2.17               h184a658_3    conda-forge
aws-c-event-stream        0.3.2                hd6ebb48_1    conda-forge
aws-c-http                0.7.13               hc690213_1    conda-forge
aws-c-io                  0.13.32              h161b759_6    conda-forge
aws-c-mqtt                0.9.6                h32970c0_2    conda-forge
aws-c-s3                  0.3.17               hb5e3142_3    conda-forge
aws-c-sdkutils            0.1.12               h184a658_2    conda-forge
aws-checksums             0.1.17               h184a658_2    conda-forge
aws-crt-cpp               0.23.1               h94c364a_5    conda-forge
aws-sdk-cpp               1.11.156             h6600424_3    conda-forge
backcall                  0.2.0              pyhd3eb1b0_0  
blas                      1.0                    openblas    conda-forge
bokeh                     3.3.1              pyhd8ed1ab_0    conda-forge
bottleneck                1.3.5           py310ha9d4c09_0  
brotli                    1.0.9                he6710b0_2  
brotli-python             1.1.0           py310hc6cd4ac_1    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
c-ares                    1.22.1               hd590300_0    conda-forge
ca-certificates           2023.11.17           hbcca054_0    conda-forge
cachetools                5.3.2              pyhd8ed1ab_0    conda-forge
cairo                     1.18.0               h3faef2a_0    conda-forge
click                     8.1.7           unix_pyh707e725_0    conda-forge
cloudpickle               3.0.0              pyhd8ed1ab_0    conda-forge
contourpy                 1.2.0           py310hd41b1e2_0    conda-forge
cramjam                   2.6.2           py310h52d8a92_0  
cuda-cccl_linux-64        12.0.90              ha770c72_1    conda-forge
cuda-cudart               12.0.107             h59595ed_6    conda-forge
cuda-cudart-dev           12.0.107             h59595ed_6    conda-forge
cuda-cudart-dev_linux-64  12.0.107             h59595ed_6    conda-forge
cuda-cudart-static        12.0.107             h59595ed_6    conda-forge
cuda-cudart-static_linux-64 12.0.107             h59595ed_6    conda-forge
cuda-cudart_linux-64      12.0.107             h59595ed_6    conda-forge
cuda-nvcc-dev_linux-64    12.0.76              ha770c72_1    conda-forge
cuda-nvcc-impl            12.0.76              h59595ed_1    conda-forge
cuda-nvcc-tools           12.0.76              h59595ed_1    conda-forge
cuda-nvrtc                12.0.76              h59595ed_1    conda-forge
cuda-nvtx                 12.0.76              hcb278e6_0    conda-forge
cuda-profiler-api         12.0.76              ha770c72_0    conda-forge
cuda-python               12.0.0          py310hfb71131_3    conda-forge
cuda-version              12.0                 hffde075_2    conda-forge
cudf                      23.10.02        cuda12_py310_231116_gece2b2c524_0    rapidsai
cuml                      23.10.00        cuda12_py310_231011_g623996072_0    rapidsai
cupy                      12.2.0          py310hfc31588_4    conda-forge
cycler                    0.11.0             pyhd3eb1b0_0  
cython                    3.0.0           py310h5eee18b_0  
cytoolz                   0.12.2          py310h2372a71_1    conda-forge
daal4py                   2023.1.1        py310h3c18c91_0  
dal                       2023.1.1         hdb19cb5_48680  
dask                      2023.9.2           pyhd8ed1ab_0    conda-forge
dask-core                 2023.9.2           pyhd8ed1ab_0    conda-forge
dask-cuda                 23.10.00        py310_231011_gdc811d3_0    rapidsai
dask-cudf                 23.10.02        cuda12_py310_231116_gece2b2c524_0    rapidsai
dbcv                      0.1.0                    pypi_0    pypi
dbus                      1.13.18              hb2f20db_0  
decorator                 5.1.1              pyhd3eb1b0_0  
distributed               2023.9.2           pyhd8ed1ab_0    conda-forge
dlpack                    0.5                  h9c3ff4c_0    conda-forge
exceptiongroup            1.0.4           py310h06a4308_0  
executing                 0.8.3              pyhd3eb1b0_0  
expat                     2.5.0                h6a678d5_0  
fastparquet               2023.8.0        py310ha9d4c09_0  
fastrlock                 0.8.2           py310hc6cd4ac_1    conda-forge
fmt                       9.1.0                h924138e_0    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hd3eb1b0_0  
font-ttf-inconsolata      2.001                hcb22688_0  
font-ttf-source-code-pro  2.030                hd3eb1b0_0  
font-ttf-ubuntu           0.83                 h8b1ccd4_0  
fontconfig                2.14.2               h14ed4e7_0    conda-forge
fonts-anaconda            1                    h8fa9717_0  
fonts-conda-ecosystem     1                    hd3eb1b0_0  
fonttools                 4.25.0             pyhd3eb1b0_0  
freetype                  2.12.1               h267a509_2    conda-forge
fsspec                    2023.10.0          pyhca7485f_0    conda-forge
gettext                   0.21.1               h27087fc_0    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
glib                      2.78.1               hfc55251_1    conda-forge
glib-tools                2.78.1               hfc55251_1    conda-forge
glog                      0.6.0                h6f12383_0    conda-forge
gmock                     1.14.0               ha770c72_1    conda-forge
graphite2                 1.3.14               h295c915_1  
gst-plugins-base          1.22.7               h8e1006c_0    conda-forge
gstreamer                 1.22.7               h98fc4e7_0    conda-forge
gtest                     1.14.0               h00ab1b0_1    conda-forge
harfbuzz                  8.3.0                h3d44ed6_0    conda-forge
hdbscan                   0.8.33          py310h1f7b6fc_4    conda-forge
icu                       73.2                 h59595ed_0    conda-forge
importlib-metadata        6.8.0              pyha770c72_0    conda-forge
importlib_metadata        6.8.0                hd8ed1ab_0    conda-forge
ipdb                      0.13.13            pyhd8ed1ab_0    conda-forge
ipython                   8.15.0          py310h06a4308_0  
jedi                      0.18.1          py310h06a4308_1  
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
joblib                    1.3.2              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.4           py310h6a678d5_0  
krb5                      1.21.2               h659d440_0    conda-forge
lame                      3.100                h7b6447c_0  
lcms2                     2.15                 hb7c19ff_3    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libabseil                 20230802.1      cxx17_h59595ed_0    conda-forge
libarrow                  12.0.1          h1935d02_14_cpu    conda-forge
libblas                   3.9.0           20_linux64_openblas    conda-forge
libbrotlicommon           1.1.0                hd590300_1    conda-forge
libbrotlidec              1.1.0                hd590300_1    conda-forge
libbrotlienc              1.1.0                hd590300_1    conda-forge
libcap                    2.69                 h0f662aa_0    conda-forge
libcblas                  3.9.0           20_linux64_openblas    conda-forge
libclang                  15.0.7          default_h7634d5b_3    conda-forge
libclang13                15.0.7          default_h9986a30_3    conda-forge
libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
libcublas                 12.0.1.189           hcb278e6_2    conda-forge
libcublas-dev             12.0.1.189           hcb278e6_2    conda-forge
libcudf                   23.10.02        cuda12_231116_gece2b2c524_0    rapidsai
libcufft                  11.0.0.21            hcb278e6_1    conda-forge
libcufile                 1.5.0.59             hcb278e6_0    conda-forge
libcufile-dev             1.5.0.59             hcb278e6_0    conda-forge
libcuml                   23.10.00        cuda12_231011_g623996072_0    rapidsai
libcumlprims              23.10.00        cuda12_231011_ge818397_0    nvidia
libcups                   2.3.3                h4637d8d_4    conda-forge
libcurand                 10.3.1.50            hcb278e6_0    conda-forge
libcurand-dev             10.3.1.50            hcb278e6_0    conda-forge
libcurl                   8.4.0                hca28451_0    conda-forge
libcusolver               11.4.2.57            hcb278e6_1    conda-forge
libcusolver-dev           11.4.2.57            hcb278e6_1    conda-forge
libcusparse               12.0.0.76            hcb278e6_1    conda-forge
libcusparse-dev           12.0.0.76            hcb278e6_1    conda-forge
libdeflate                1.19                 hd590300_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.12               hf998b51_1    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libflac                   1.4.3                h59595ed_0    conda-forge
libgcc-ng                 13.2.0               h807b86a_3    conda-forge
libgcrypt                 1.10.2               hd590300_0    conda-forge
libgfortran-ng            13.2.0               h69a702a_3    conda-forge
libgfortran5              13.2.0               ha4646dd_3    conda-forge
libglib                   2.78.1               h783c2da_1    conda-forge
libgomp                   13.2.0               h807b86a_3    conda-forge
libgoogle-cloud           2.12.0               h8d7e28b_2    conda-forge
libgpg-error              1.47                 h71f35ed_0    conda-forge
libgrpc                   1.57.0               ha4d0f93_2    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libjpeg-turbo             3.0.0                hd590300_1    conda-forge
libkvikio                 23.10.00        cuda12_231011_g5ea0525_0    rapidsai
liblapack                 3.9.0           20_linux64_openblas    conda-forge
libllvm14                 14.0.6               hcd5def8_4    conda-forge
libllvm15                 15.0.7               h5cf9203_3    conda-forge
libnghttp2                1.58.0               h47da74e_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libnuma                   2.0.16               h0b41bf4_1    conda-forge
libnvjitlink              12.0.76              hcb278e6_1    conda-forge
libogg                    1.3.5                h27cfd23_1  
libopenblas               0.3.25          pthreads_h413a1c8_0    conda-forge
libopus                   1.3.1                h7b6447c_0  
libpng                    1.6.39               h753d276_0    conda-forge
libpq                     16.1                 hfc447b1_0    conda-forge
libprotobuf               4.23.4               hf27288f_6    conda-forge
libraft                   23.10.00        cuda12_231011_gafdddfb3_0    rapidsai
libraft-headers           23.10.00        cuda12_231011_gafdddfb3_0    rapidsai
libraft-headers-only      23.10.00        cuda12_231011_gafdddfb3_0    rapidsai
librmm                    23.10.00        cuda12_231011_gf8ac6f8e_0    rapidsai
libsndfile                1.2.2                hc60ed4a_1    conda-forge
libsqlite                 3.44.2               h2797004_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_3    conda-forge
libsystemd0               254                  h3516f8a_0    conda-forge
libthrift                 0.19.0               hb90f79a_1    conda-forge
libtiff                   4.6.0                ha9c0a0a_2    conda-forge
libutf8proc               2.8.0                h166bdaf_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libvorbis                 1.3.7                h7b6447c_0  
libwebp-base              1.3.2                hd590300_0    conda-forge
libxcb                    1.15                 h0b41bf4_0    conda-forge
libxkbcommon              1.6.0                h5d7e998_0    conda-forge
libxml2                   2.11.6               h232c23b_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
llvmlite                  0.40.1          py310h1b8f574_0    conda-forge
locket                    1.0.0              pyhd8ed1ab_0    conda-forge
lz4                       4.3.2           py310h350c4a5_1    conda-forge
lz4-c                     1.9.4                hcb278e6_0    conda-forge
markdown-it-py            3.0.0              pyhd8ed1ab_0    conda-forge
markupsafe                2.1.3           py310h2372a71_1    conda-forge
matplotlib                3.8.0           py310h06a4308_0  
matplotlib-base           3.8.0           py310h1128e8f_0  
matplotlib-inline         0.1.6           py310h06a4308_0  
mdurl                     0.1.0              pyhd8ed1ab_0    conda-forge
mpg123                    1.32.3               h59595ed_0    conda-forge
mpi                       1.0                       mpich    conda-forge
mpich                     4.1.1                hbae89fd_0  
mpmath                    1.3.0                    pypi_0    pypi
msgpack-python            1.0.7           py310hd41b1e2_0    conda-forge
munkres                   1.1.4                      py_0  
mysql-common              8.0.33               hf1915f5_6    conda-forge
mysql-libs                8.0.33               hca2cd23_6    conda-forge
nccl                      2.19.4.1             h3a97aeb_0    conda-forge
ncurses                   6.4                  h59595ed_2    conda-forge
nspr                      4.35                 h6a678d5_0  
nss                       3.95                 h1d7d5a4_0    conda-forge
numba                     0.57.1          py310h0f6aa51_0    conda-forge
numexpr                   2.8.7           py310h286c3b5_0  
numpy                     1.24.4          py310ha4c1d20_0    conda-forge
nvcomp                    2.6.1                h10b603f_3    conda-forge
nvtx                      0.2.8           py310h2372a71_1    conda-forge
openblas                  0.3.25          pthreads_h7a3da1a_0    conda-forge
openjpeg                  2.5.0                h488ebb8_3    conda-forge
openssl                   3.2.0                hd590300_0    conda-forge
orc                       1.9.0                h52d3b3c_2    conda-forge
packaging                 23.2               pyhd8ed1ab_0    conda-forge
pandas                    1.5.3           py310h1128e8f_0  
parso                     0.8.3              pyhd3eb1b0_0  
partd                     1.4.1              pyhd8ed1ab_0    conda-forge
pcre2                     10.42                hebb0a14_0  
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    10.1.0          py310h01dd4db_0    conda-forge
pip                       23.3.1             pyhd8ed1ab_0    conda-forge
pixman                    0.42.2               h59595ed_0    conda-forge
ply                       3.11            py310h06a4308_0  
prompt-toolkit            3.0.36          py310h06a4308_0  
protobuf                  4.23.4          py310h620c231_3    conda-forge
psutil                    5.9.5           py310h2372a71_1    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3eb1b0_2  
pulseaudio-client         16.1                 hb77b528_5    conda-forge
pure_eval                 0.2.2              pyhd3eb1b0_0  
pyarrow                   12.0.1          py310hf9e7431_14_cpu    conda-forge
pygments                  2.17.2             pyhd8ed1ab_0    conda-forge
pylibraft                 23.10.00        cuda12_py310_231011_gafdddfb3_0    rapidsai
pynndescent               0.5.10          py310h06a4308_0  
pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9           py310h06a4308_0  
pyqt                      5.15.10         py310h6a678d5_0  
pyqt5-sip                 12.13.0         py310h5eee18b_0  
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.10.13         hd12c33a_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.10                    4_cp310    conda-forge
pytz                      2023.3.post1       pyhd8ed1ab_0    conda-forge
pyyaml                    6.0.1           py310h2372a71_1    conda-forge
qt-main                   5.15.8              h82b777d_17    conda-forge
raft-dask                 23.10.00        cuda12_py310_231011_gafdddfb3_0    rapidsai
rdma-core                 28.9                 h59595ed_1    conda-forge
re2                       2023.03.02           h8c504da_0    conda-forge
readline                  8.2                  h8228510_1    conda-forge
rich                      13.7.0             pyhd8ed1ab_0    conda-forge
rmm                       23.10.00        cuda12_py310_231011_gf8ac6f8e_0    rapidsai
s2n                       1.3.54               h06160fa_0    conda-forge
scikit-learn              1.3.0           py310h1128e8f_0  
scikit-learn-intelex      2023.1.1        py310h06a4308_0  
scipy                     1.11.4          py310hb13e2d6_0    conda-forge
setuptools                68.2.2             pyhd8ed1ab_0    conda-forge
sip                       6.7.12          py310h6a678d5_0  
six                       1.16.0             pyh6c4a22f_0    conda-forge
snappy                    1.1.10               h9fff704_0    conda-forge
sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
spdlog                    1.11.0               h9b3ece8_1    conda-forge
stack_data                0.2.0              pyhd3eb1b0_0  
tbb                       2021.8.0             hdb19cb5_0  
tblib                     2.0.0              pyhd8ed1ab_0    conda-forge
threadpoolctl             2.2.0              pyh0d69192_0  
tk                        8.6.13          noxft_h4845f30_101    conda-forge
toml                      0.10.2             pyhd3eb1b0_0  
tomli                     2.0.1           py310h06a4308_0  
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
tornado                   6.3.3           py310h2372a71_1    conda-forge
tqdm                      4.65.0          py310h2f386ee_0  
traitlets                 5.7.1           py310h06a4308_0  
treelite                  3.9.1           py310h4a6579d_0    conda-forge
treelite-runtime          3.9.1                    pypi_0    pypi
typing_extensions         4.8.0              pyha770c72_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
ucx                       1.14.1               h195a15c_5    conda-forge
ucx-proc                  1.0.0                       gpu    rapidsai
ucx-py                    0.34.00         py310_231011_g17dceab_0    rapidsai
umap-learn                0.5.3           py310h06a4308_0  
urllib3                   2.1.0              pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.5              pyhd3eb1b0_0  
wheel                     0.42.0             pyhd8ed1ab_0    conda-forge
xcb-util                  0.4.0                hd590300_1    conda-forge
xcb-util-image            0.4.0                h8ee46fc_1    conda-forge
xcb-util-keysyms          0.4.0                h8ee46fc_1    conda-forge
xcb-util-renderutil       0.3.9                hd590300_1    conda-forge
xcb-util-wm               0.4.1                h8ee46fc_1    conda-forge
xkeyboard-config          2.40                 hd590300_0    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.1.1                hd590300_0    conda-forge
xorg-libsm                1.2.4                h7391055_0    conda-forge
xorg-libx11               1.8.7                h8ee46fc_0    conda-forge
xorg-libxau               1.0.11               hd590300_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h0b41bf4_2    conda-forge
xorg-libxrender           0.9.11               hd590300_0    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h0b41bf4_1003    conda-forge
xorg-xf86vidmodeproto     2.3.1             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h27cfd23_1007  
xyzservices               2023.10.1          pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
zict                      3.0.0              pyhd8ed1ab_0    conda-forge
zipp                      3.17.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.5                hfc55251_0    conda-forge
dantegd commented 9 months ago

Thanks for the issue @jiho, this does not seem like an OOM which is consistent with what you see on NVIDIA-SMI, but potentially a bug in the code somewhere. Thanks for all details and reproducer, we will look into it ASAP!

aktgpt commented 5 months ago

Any update on this issue? I'm also running into the same problem.