[BUG] Increasing memory usage leading to OOM when running UMAP in loop

ietz commented 3 years ago

Describe the bug When I fit multiple UMAP models one after another, the GPU memory usage increases with most iterations, even though I do not keep any references to prior models or their results. At some point, I get an OOM error. As I do not keep any references, I would expect any data to be garbage collected to prevent the OOM from happening.

Steps/Code to reproduce bug Here is a link to my Jupyter notebook on Google Colab: https://colab.research.google.com/drive/1mZew58DdWdI2cBuSRW5F3uUD7lXjHMUk

The issue occurs in the code segment

for i in itertools.count():
  cuml.UMAP(n_neighbors=15) \
      .fit(data, knn_graph=knn_graph)

Looking at the GPU memory usage over time, I can see that the model is not always garbage collected between iterations. The data accumulates over a few iterations and is then deleted every so often, but not all of it. At some point this seems to always lead to an out of memory error. With the input data shape I chose for the Colab demo this took a lot longer than I expected (approx. 20 minutes, 641 iterations), but I think plotting the memory usage over time shows the issue quite nicely:

Memory Usage

With larger datasets such as those that I used when I originally encountered this issue, the OOM happens after way fewer iterations, maybe 10. In the image you can see small and large "teeth". I think when I originally encountered this issue, I had the OOM on one of the small teeth, even before the first large drop in memory usage.

Expected behavior I would expect that the memory usage does not increase to the point of an OOM error.

Environment details

Environment location: Google Colab
Linux Distro/Architecture: Ubuntu 18.04.5 x86_64
GPU Model/Driver: V100 and driver 460.32.03
CUDA: nvidia-smi reports CUDA 11.2, but the conda installation includes cudatoolkit 11.0.221 (see cell outputs in Colab). Not sure which of the two is relevant.
Method of cuDF & cuML install: conda, using the scripts from rapidsai/rapidsai-csp-utils

conda list
``` # packages in environment at /usr/local: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_gnu conda-forge abseil-cpp 20210324.1 h9c3ff4c_0 conda-forge aiohttp 3.7.4.post0 py37h5e8e339_0 conda-forge anyio 3.2.1 py37h89c1867_0 conda-forge appdirs 1.4.4 pyh9f0ad1d_0 conda-forge argon2-cffi 20.1.0 py37h5e8e339_2 conda-forge arrow-cpp 1.0.1 py37haa335b2_40_cuda conda-forge arrow-cpp-proc 3.0.0 cuda conda-forge async-timeout 3.0.1 py_1000 conda-forge async_generator 1.10 py_0 conda-forge attrs 21.2.0 pyhd8ed1ab_0 conda-forge aws-c-cal 0.5.11 h95a6274_0 conda-forge aws-c-common 0.6.2 h7f98852_0 conda-forge aws-c-event-stream 0.2.7 h3541f99_13 conda-forge aws-c-io 0.10.5 hfb6a706_0 conda-forge aws-checksums 0.1.11 ha31a3da_7 conda-forge aws-sdk-cpp 1.8.186 hb4091e7_3 conda-forge backcall 0.2.0 pyh9f0ad1d_0 conda-forge backports 1.0 py_2 conda-forge backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge blazingsql 21.6.0 pypi_0 pypi bleach 3.3.1 pyhd8ed1ab_0 conda-forge blinker 1.4 py_1 conda-forge bokeh 2.2.3 py37h89c1867_0 conda-forge boost 1.72.0 py37h48f8a5e_1 conda-forge boost-cpp 1.72.0 h9d3c048_4 conda-forge brotli 1.0.9 h7f98852_5 conda-forge brotli-bin 1.0.9 h7f98852_5 conda-forge brotlipy 0.7.0 py37h5e8e339_1001 conda-forge bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.17.1 h7f98852_1 conda-forge ca-certificates 2021.5.30 ha878542_0 conda-forge cachetools 4.2.2 pyhd8ed1ab_0 conda-forge cairo 1.16.0 h6cf1ce9_1008 conda-forge certifi 2021.5.30 py37h89c1867_0 conda-forge cffi 1.14.5 py37hc58025e_0 conda-forge cfitsio 3.470 hb418390_7 conda-forge chardet 4.0.0 py37h89c1867_1 conda-forge click 7.1.2 pyh9f0ad1d_0 conda-forge click-plugins 1.1.1 py_0 conda-forge cligj 0.7.2 pyhd8ed1ab_0 conda-forge cloudpickle 1.6.0 py_0 conda-forge colorcet 2.0.6 pyhd8ed1ab_0 conda-forge conda 4.10.3 py37h89c1867_0 conda-forge conda-package-handling 1.7.2 py37hb5d75c8_0 conda-forge cryptography 3.4.5 py37h5d9358c_1 conda-forge cudatoolkit 11.0.221 h6bb024c_0 nvidia cudf 21.06.01 cuda_11.0_py37_g101fc0fda4_2 rapidsai cudf_kafka 21.06.01 py37_g101fc0fda4_2 rapidsai cugraph 21.06.00 py37_gf9ffd2de_0 rapidsai cuml 21.06.02 cuda11.0_py37_g7dfbf8d9e_0 rapidsai cupy 9.0.0 py37h4fdb0f7_0 conda-forge curl 7.77.0 hea6ffbf_0 conda-forge cusignal 21.06.00 py38_ga78207b_0 rapidsai cuspatial 21.06.00 py37_g37798cd_0 rapidsai custreamz 21.06.01 py37_g101fc0fda4_2 rapidsai cuxfilter 21.06.00 py37_g9459467_0 rapidsai cycler 0.10.0 py_2 conda-forge cyrus-sasl 2.1.27 h230043b_2 conda-forge cytoolz 0.11.0 py37h5e8e339_3 conda-forge dask 2021.5.0 pyhd8ed1ab_0 conda-forge dask-core 2021.5.0 pyhd8ed1ab_0 conda-forge dask-cuda 21.06.00 py37_0 rapidsai dask-cudf 21.06.01 py37_g101fc0fda4_2 rapidsai datashader 0.11.1 pyh9f0ad1d_0 conda-forge datashape 0.5.4 py_1 conda-forge decorator 4.4.2 py_0 conda-forge defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge distributed 2021.5.0 py37h89c1867_0 conda-forge dlpack 0.5 h9c3ff4c_0 conda-forge entrypoints 0.3 pyhd8ed1ab_1003 conda-forge expat 2.4.1 h9c3ff4c_0 conda-forge faiss-proc 1.0.0 cuda rapidsai fastavro 1.4.3 py37h5e8e339_0 conda-forge fastrlock 0.6 py37hcd2ae1e_1 conda-forge fiona 1.8.20 py37ha0cc35a_0 conda-forge fontconfig 2.13.1 hba837de_1005 conda-forge freetype 2.10.4 h0708190_1 conda-forge freexl 1.0.6 h7f98852_0 conda-forge fsspec 2021.7.0 pyhd8ed1ab_0 conda-forge future 0.18.2 py37h89c1867_3 conda-forge gcsfs 2021.7.0 pyhd8ed1ab_0 conda-forge gdal 3.2.2 py37hb0e9ad2_0 conda-forge geopandas 0.9.0 pyhd8ed1ab_1 conda-forge geopandas-base 0.9.0 pyhd8ed1ab_1 conda-forge geos 3.9.1 h9c3ff4c_2 conda-forge geotiff 1.6.0 hcf90da6_5 conda-forge gettext 0.19.8.1 h0b5b191_1005 conda-forge gflags 2.2.2 he1b5a44_1004 conda-forge giflib 5.2.1 h36c2ea0_2 conda-forge glog 0.5.0 h48cff8f_0 conda-forge google-auth 1.33.0 pyh6c4a22f_0 conda-forge google-auth-oauthlib 0.4.4 pyhd8ed1ab_0 conda-forge google-cloud-cpp 1.28.0 hbd34f9f_0 conda-forge greenlet 1.1.0 py37hcd2ae1e_0 conda-forge grpc-cpp 1.38.0 h2519f57_0 conda-forge hdf4 4.2.15 h10796ff_3 conda-forge hdf5 1.10.6 nompi_h6a2412b_1114 conda-forge heapdict 1.0.1 py_0 conda-forge icu 68.1 h58526e2_0 conda-forge idna 2.10 pyh9f0ad1d_0 conda-forge importlib-metadata 4.6.1 py37h89c1867_0 conda-forge ipykernel 5.5.5 py37h085eea5_0 conda-forge ipython 7.25.0 py37h085eea5_1 conda-forge ipython_genutils 0.2.0 py_1 conda-forge ipywidgets 7.6.3 pyhd3deb0d_0 conda-forge jedi 0.18.0 py37h89c1867_2 conda-forge jinja2 3.0.1 pyhd8ed1ab_0 conda-forge joblib 1.0.1 pyhd8ed1ab_0 conda-forge jpeg 9d h36c2ea0_0 conda-forge jpype1 1.3.0 py37h2527ec5_0 conda-forge json-c 0.15 h98cffda_0 conda-forge jsonschema 3.2.0 pyhd8ed1ab_3 conda-forge jupyter-server-proxy 3.1.0 pyhd8ed1ab_0 conda-forge jupyter_client 6.1.12 pyhd8ed1ab_0 conda-forge jupyter_core 4.7.1 py37h89c1867_0 conda-forge jupyter_server 1.9.0 pyhd8ed1ab_0 conda-forge jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge jupyterlab_widgets 1.0.0 pyhd8ed1ab_1 conda-forge kealib 1.4.14 hcc255d8_2 conda-forge kiwisolver 1.3.1 py37h2527ec5_1 conda-forge krb5 1.19.1 hcc1bbae_0 conda-forge lcms2 2.12 hddcbb42_0 conda-forge ld_impl_linux-64 2.35.1 hea4e1c9_2 conda-forge libarchive 3.5.1 h3f442fb_1 conda-forge libblas 3.9.0 9_openblas conda-forge libbrotlicommon 1.0.9 h7f98852_5 conda-forge libbrotlidec 1.0.9 h7f98852_5 conda-forge libbrotlienc 1.0.9 h7f98852_5 conda-forge libcblas 3.9.0 9_openblas conda-forge libcrc32c 1.1.1 h9c3ff4c_2 conda-forge libcudf 21.06.01 cuda11.0_g101fc0fda4_2 rapidsai libcudf_kafka 21.06.01 g101fc0fda4_2 rapidsai libcugraph 21.06.00 cuda11.0_gf9ffd2de_0 rapidsai libcuml 21.06.02 cuda11.0_g7dfbf8d9e_0 rapidsai libcumlprims 21.06.00 cuda11.0_gfda2e6c_0 nvidia libcurl 7.77.0 h2574ce0_0 conda-forge libcuspatial 21.06.00 cuda11.0_g37798cd_0 rapidsai libdap4 3.20.6 hd7c4107_2 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 h516909a_1 conda-forge libevent 2.1.10 hcdb4288_3 conda-forge libfaiss 1.7.0 cuda110h8045045_8_cuda conda-forge libffi 3.3 h58526e2_2 conda-forge libgcc-ng 9.3.0 h2828fa1_18 conda-forge libgcrypt 1.9.3 h7f98852_1 conda-forge libgdal 3.2.2 h804b7da_0 conda-forge libgfortran-ng 9.3.0 hff62375_19 conda-forge libgfortran5 9.3.0 hff62375_19 conda-forge libglib 2.68.3 h3e27bee_0 conda-forge libgomp 9.3.0 h2828fa1_18 conda-forge libgpg-error 1.42 h9c3ff4c_0 conda-forge libgsasl 1.8.0 2 conda-forge libhwloc 2.3.0 h5e5b7d1_1 conda-forge libiconv 1.16 h516909a_0 conda-forge libkml 1.3.0 hd79254b_1012 conda-forge liblapack 3.9.0 9_openblas conda-forge libllvm10 10.0.1 he513fc3_3 conda-forge libnetcdf 4.7.4 nompi_h56d31a8_107 conda-forge libnghttp2 1.43.0 h812cca2_0 conda-forge libntlm 1.4 h7f98852_1002 conda-forge libopenblas 0.3.15 pthreads_h8fe5266_1 conda-forge libpng 1.6.37 h21135ba_2 conda-forge libpq 13.3 hd57d9b9_0 conda-forge libprotobuf 3.16.0 h780b84a_0 conda-forge librdkafka 1.5.3 hc49e61c_1 conda-forge librmm 21.06.00 cuda11.0_gee432a0_0 rapidsai librttopo 1.1.0 h1185371_6 conda-forge libsodium 1.0.18 h36c2ea0_1 conda-forge libsolv 0.7.17 h780b84a_0 conda-forge libspatialindex 1.9.3 h9c3ff4c_3 conda-forge libspatialite 5.0.1 h20cb978_4 conda-forge libssh2 1.9.0 ha56f1ee_6 conda-forge libstdcxx-ng 9.3.0 h6de172a_18 conda-forge libthrift 0.14.1 he6d91bd_2 conda-forge libtiff 4.2.0 hbd63e13_2 conda-forge libutf8proc 2.6.1 h7f98852_0 conda-forge libuuid 2.32.1 h7f98852_1000 conda-forge libuv 1.41.1 h7f98852_0 conda-forge libwebp 1.2.0 h3452ae3_0 conda-forge libwebp-base 1.2.0 h7f98852_2 conda-forge libxcb 1.13 h7f98852_1003 conda-forge libxgboost 1.4.2dev.rapidsai21.06 cuda11.0_0 rapidsai libxml2 2.9.12 h72842e0_0 conda-forge llvmlite 0.36.0 py37h9d7f4d0_0 conda-forge locket 0.2.0 py_2 conda-forge lz4-c 1.9.3 h9c3ff4c_0 conda-forge lzo 2.10 h516909a_1000 conda-forge mamba 0.8.0 py37h7f483ca_0 conda-forge mapclassify 2.4.2 pyhd8ed1ab_0 conda-forge markdown 3.3.4 pyhd8ed1ab_0 conda-forge markupsafe 2.0.1 py37h5e8e339_0 conda-forge matplotlib-base 3.4.2 py37hdd32ed1_0 conda-forge matplotlib-inline 0.1.2 pyhd8ed1ab_2 conda-forge mistune 0.8.4 py37h5e8e339_1004 conda-forge msgpack-python 1.0.2 py37h2527ec5_1 conda-forge multidict 5.1.0 py37h5e8e339_1 conda-forge multipledispatch 0.6.0 py_0 conda-forge munch 2.5.0 py_0 conda-forge nbclient 0.5.3 pyhd8ed1ab_0 conda-forge nbconvert 6.1.0 py37h89c1867_0 conda-forge nbformat 5.1.3 pyhd8ed1ab_0 conda-forge nccl 2.10.3.1 h96e36e3_0 conda-forge ncurses 6.2 h58526e2_4 conda-forge nest-asyncio 1.5.1 pyhd8ed1ab_0 conda-forge netifaces 0.10.9 py37h5e8e339_1003 conda-forge networkx 2.6.1 pyhd8ed1ab_1 conda-forge nlohmann_json 3.9.1 h9c3ff4c_1 conda-forge nodejs 14.17.1 h92b4a50_1 conda-forge notebook 6.4.0 pyha770c72_0 conda-forge numba 0.53.1 py37hb11d6e1_1 conda-forge numpy 1.21.1 py37h038b26d_0 conda-forge nvtx 0.2.3 py37h5e8e339_0 conda-forge oauthlib 3.1.1 pyhd8ed1ab_0 conda-forge olefile 0.46 pyh9f0ad1d_1 conda-forge openjdk 8.0.282 h7f98852_0 conda-forge openjpeg 2.4.0 hb52868f_1 conda-forge openssl 1.1.1k h7f98852_0 conda-forge orc 1.6.7 h89a63ab_2 conda-forge packaging 21.0 pyhd8ed1ab_0 conda-forge pandas 1.2.5 py37h219a48f_0 conda-forge pandoc 2.14.0.3 h7f98852_0 conda-forge pandocfilters 1.4.2 py_1 conda-forge panel 0.10.3 pyhd8ed1ab_0 conda-forge param 1.11.1 pyh6c4a22f_0 conda-forge parquet-cpp 1.5.1 2 conda-forge parso 0.8.2 pyhd8ed1ab_0 conda-forge partd 1.2.0 pyhd8ed1ab_0 conda-forge pcre 8.45 h9c3ff4c_0 conda-forge pexpect 4.8.0 pyh9f0ad1d_2 conda-forge pickle5 0.0.11 py37h5e8e339_0 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 8.2.0 py37h4600e1f_1 conda-forge pip 21.0.1 pyhd8ed1ab_0 conda-forge pixman 0.40.0 h36c2ea0_0 conda-forge poppler 21.03.0 h93df280_0 conda-forge poppler-data 0.4.10 0 conda-forge postgresql 13.3 h2510834_0 conda-forge proj 8.0.0 h277dcde_0 conda-forge prometheus_client 0.11.0 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.19 pyha770c72_0 conda-forge protobuf 3.16.0 py37hcd2ae1e_0 conda-forge psutil 5.8.0 py37h5e8e339_1 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge py-xgboost 1.4.2dev.rapidsai21.06 cuda11.0py37_0 rapidsai pyarrow 1.0.1 py37hb63ea2f_40_cuda conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.7 py_0 conda-forge pycosat 0.6.3 py37h5e8e339_1006 conda-forge pycparser 2.20 pyh9f0ad1d_2 conda-forge pyct 0.4.6 py_0 conda-forge pyct-core 0.4.6 py_0 conda-forge pydeck 0.5.0 pyh9f0ad1d_0 conda-forge pyee 7.0.4 pyh9f0ad1d_0 conda-forge pygments 2.9.0 pyhd8ed1ab_0 conda-forge pyhive 0.6.4 pyhd8ed1ab_0 conda-forge pyjwt 2.1.0 pyhd8ed1ab_0 conda-forge pynvml 11.0.0 pyhd8ed1ab_0 conda-forge pyopenssl 20.0.1 pyhd8ed1ab_0 conda-forge pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge pyppeteer 0.2.2 py_1 conda-forge pyproj 3.0.1 py37h2bb2a07_1 conda-forge pyrsistent 0.17.3 py37h5e8e339_2 conda-forge pysocks 1.7.1 py37h89c1867_3 conda-forge python 3.7.10 hffdb5ce_100_cpython conda-forge python-confluent-kafka 1.5.0 py37h8f50634_0 conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python_abi 3.7 2_cp37m conda-forge pytz 2021.1 pyhd8ed1ab_0 conda-forge pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge pyviz_comms 2.1.0 pyhd8ed1ab_0 conda-forge pyyaml 5.4.1 py37h5e8e339_0 conda-forge pyzmq 22.1.0 py37h336d617_0 conda-forge rapids 21.06.00 cuda11.0_py37_ge3c8282_427 rapidsai rapids-blazing 21.06.00 cuda11.0_py37_ge3c8282_427 rapidsai rapids-xgboost 21.06.00 cuda11.0_py37_ge3c8282_427 rapidsai re2 2021.04.01 h9c3ff4c_0 conda-forge readline 8.1 h46c0cb4_0 conda-forge reproc 14.2.1 h36c2ea0_0 conda-forge reproc-cpp 14.2.1 h58526e2_0 conda-forge requests 2.25.1 pyhd3deb0d_0 conda-forge requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge requests-unixsocket 0.2.0 py_0 conda-forge rmm 21.06.00 cuda_11.0_py37_gee432a0_0 rapidsai rsa 4.7.2 pyh44b312d_0 conda-forge rtree 0.9.7 py37h0b55af0_1 conda-forge ruamel_yaml 0.15.80 py37h5e8e339_1004 conda-forge s2n 1.0.10 h9b69904_0 conda-forge sasl 0.3.1 py37hcd2ae1e_0 conda-forge scikit-learn 0.24.2 py37h18a542f_0 conda-forge scipy 1.7.0 py37h29e03ee_1 conda-forge send2trash 1.7.1 pyhd8ed1ab_0 conda-forge setuptools 49.6.0 py37h89c1867_3 conda-forge shapely 1.7.1 py37h2d1e849_5 conda-forge simpervisor 0.4 pyhd8ed1ab_0 conda-forge six 1.15.0 pyh9f0ad1d_0 conda-forge snappy 1.1.8 he1b5a44_3 conda-forge sniffio 1.2.0 py37h89c1867_1 conda-forge sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge spdlog 1.8.5 h4bd325d_0 conda-forge sqlalchemy 1.4.21 py37h5e8e339_0 conda-forge sqlite 3.34.0 h74cdb3f_0 conda-forge streamz 0.6.2 pyh44b312d_0 conda-forge tblib 1.7.0 pyhd8ed1ab_0 conda-forge terminado 0.10.1 py37h89c1867_0 conda-forge testpath 0.5.0 pyhd8ed1ab_0 conda-forge threadpoolctl 2.2.0 pyh8a188c0_0 conda-forge thrift 0.13.0 py37hcd2ae1e_2 conda-forge thrift_sasl 0.4.2 py37h8f50634_0 conda-forge tiledb 2.2.9 h91fcb0e_0 conda-forge tk 8.6.10 h21135ba_1 conda-forge toolz 0.11.1 py_0 conda-forge tornado 6.1 py37h5e8e339_1 conda-forge tqdm 4.59.0 pyhd8ed1ab_0 conda-forge traitlets 5.0.5 py_0 conda-forge treelite 1.3.0 py37hfdac9b6_0 conda-forge treelite-runtime 1.3.0 pypi_0 pypi typing-extensions 3.10.0.0 hd8ed1ab_0 conda-forge typing_extensions 3.10.0.0 pyha770c72_0 conda-forge tzcode 2021a h7f98852_2 conda-forge tzdata 2021a he74cb21_1 conda-forge ucx 1.9.0+gcd9efd3 cuda11.0_0 rapidsai ucx-proc 1.0.0 gpu rapidsai ucx-py 0.20.0 py37_gcd9efd3_0 rapidsai urllib3 1.26.3 pyhd8ed1ab_0 conda-forge wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge webencodings 0.5.1 py_1 conda-forge websocket-client 0.57.0 py37h89c1867_4 conda-forge websockets 8.1 py37h5e8e339_3 conda-forge wheel 0.36.2 pyhd3deb0d_0 conda-forge widgetsnbextension 3.5.1 py37h89c1867_4 conda-forge xarray 0.18.2 pyhd8ed1ab_0 conda-forge xerces-c 3.2.3 h9d8b166_2 conda-forge xgboost 1.4.2dev.rapidsai21.06 cuda11.0py37_0 rapidsai xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.0.10 h7f98852_0 conda-forge xorg-libsm 1.2.3 hd9c2040_1000 conda-forge xorg-libx11 1.7.2 h7f98852_0 conda-forge xorg-libxau 1.0.9 h7f98852_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h7f98852_1 conda-forge xorg-libxrender 0.9.10 h7f98852_1003 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h7f98852_1002 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.5 h516909a_1 conda-forge yaml 0.2.5 h516909a_0 conda-forge yarl 1.6.3 py37h5e8e339_2 conda-forge zeromq 4.3.4 h9c3ff4c_0 conda-forge zict 2.0.0 py_0 conda-forge zipp 3.5.0 pyhd8ed1ab_0 conda-forge zlib 1.2.11 h516909a_1010 conda-forge zstd 1.4.9 ha95c52a_0 conda-forge ```

cjnolet commented 3 years ago

Hi @ietz, thank you are filing an issue for this. I ran your script on my V100 (rapids 21.08 nightly packages) and was able to reproduce the trending sawtooth pattern that you've pointed out. I ran the loop for about 25 minutes while running a watch -n 0.1 nvidia-smi in a separate window and noticed it peaked around 12-14gb but didn't go any higher.

Adding a gc.collect() after each loop iteration seemed to make it consistently peak around 4gb and revert to the same value (+= 0.1gb) after the loop. If you are able, can you try adding the gc.collect() after each iteration and let us know if it fixes the problem?

ietz commented 3 years ago

Hey @cjnolet and thank you for your response

The gc.collect() call after every iteration indeed does resolve my issue, and my parameter sweep now finished without any further complications. I was not aware of this command and read in some other issue here that just using del to delete the reference should be enough. Thank you!

If you still want to reproduce the OOM without gc.collect() you could increase the size of the data array. With a shape of (1_000_000, 500) I got an OOM after just 10 iterations, less than 1min of execution time. With that shape the memory usage after the iterations was at 4.4, 6.3, 8.2, 10.1, 12.0, 6.3, 8.2, 10.1, 12.0, 13.8 gb, followed by the OOM.

In terms of results, it sadly seems that even with my parameter sweep I could not get outputs from cuML UMAP that are comparable to those of the umap-learn library, as ~¼ of points are mapped to strange outlier positions far away from the main structure. I guess I'll just watch #3467 and try again once that is resolved

cjnolet commented 3 years ago

As a result of your experience with this problem in RAPIDS, do you think it might be helpful if we added some documentation about the use of gc.collect()? If so, we can convert this issue over to a feature request.

ietz commented 3 years ago

Sure, I think some info about that might very well help someone as long as you can find it. My problem was that I thought I had to look for some RAPIDS-specific solution as the problem was about GPU memory. As it's just standard Python, a short "gc.collect also works with RAPIDS" would probably have been enough in my case.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapidsai / cuml

[BUG] Increasing memory usage leading to OOM when running UMAP in loop #4068