rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[BUG] UMAP transformations consistently fail at modest scale (~1.5m rows, any number of features) #5376

Open willb opened 1 year ago

willb commented 1 year ago

Describe the bug

As of RAPIDS 23.02, UMAP transformations consistently fail with CUDA errors when projecting more than 1.5 million rows or so. These transformations consistently worked in RAPIDS 21.12 and 21.10, so this is a regression.

Steps/Code to reproduce bug

  1. Create a temporary directory, say umap-crash.
  2. Download umap.ipynb to this directory.
  3. cd to umap-crash.

We'll then run this code with RAPIDS 21.12:

docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v ${PWD}:/rapids/notebooks/host --workdir /rapids/notebooks/host rapidsai/rapidsai:21.12-cuda11.5-runtime-ubuntu20.04 jupyter nbconvert --execute umap.ipynb --to html --allow-errors --log-level 0

This will succeed. However, if we run this notebook under RAPIDS 23.02, it will fail with an illegal access or CURAND error:

docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v ${PWD}:/rapids/notebooks/host --workdir /rapids/notebooks/host nvcr.io/nvidia/rapidsai/rapidsai-core:23.04-cuda11.8-runtime-ubuntu22.04-py3.10 jupyter nbconvert --execute umap.ipynb --to html --allow-errors --log-level 0

Expected behavior

It should be possible to transform a small (~⅛ gb) data set with UMAP and RAPIDS on any supported GPU.

Environment details (please complete the following information):

Additional context

This may be related to #4984.

dantegd commented 1 year ago

Thanks for the issue @willb, I can confirm that I can reproduce. In a V100, with the provided size it ran fine:

(cuml0419) ➜  RAPIDS vim umap_repro.py
(cuml0419) ➜  RAPIDS python umap_repro.py
Total memory usage for `subset` is 0.008382 GB
Total memory usage for `df` is 0.1118 GB
projecting 1.0% of df...
projecting 5.0% of df...
projecting 10.0% of df...
projecting 25.0% of df...
projecting 50.0% of df...
projecting 75.0% of df...
projecting 100.0% of df...

But once I increased the size of the dataframe I can run into the crash, we'll look into it and provide a fix as soon as possible.

willb commented 1 year ago

Thank you!

beckernick commented 6 days ago

@dantegd is this still relevant?