Open jiho opened 9 months ago
Thanks for the issue @jiho, this does not seem like an OOM which is consistent with what you see on NVIDIA-SMI, but potentially a bug in the code somewhere. Thanks for all details and reproducer, we will look into it ASAP!
Any update on this issue? I'm also running into the same problem.
Describe the bug
I am fitting many (hundreds) umap models to various subsets of several datasets and then transforming the full dataset (2M points) into the reduced space. When transforming I sometimes get the following error:
and then about a dozen repetitions of:
The occurrence of these errors seems quite random: sometimes I get them after 50 fits+transform, sometimes after hundreds. It does not seem related to the nature/content of the data (after re-launch it runs fine on a dataset combination it just failed on).
The memory usage reported by
nvtop
ornvidia-smi
is always reasonnable (at 2 to 10GB out of 48).Steps/Code to reproduce bug
Since this is memory related I tried to perform the transformation in smaller chunks of data so the code looks like
After each fit+transform combo, I also added
rmm.reinitialize()
(which seemed to help: less frequent errors; but it may have been something external too).The full code is there https://github.com/jiho/morphopart; functions in morphopart.py, loop in explore_params.py.
Expected behavior I would expect all transformations to run the same, without error.
Environment details