rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.27k stars 536 forks source link

[BUG] Increasing memory usage leading to OOM when running UMAP in loop #4068

Open ietz opened 3 years ago

ietz commented 3 years ago

Describe the bug When I fit multiple UMAP models one after another, the GPU memory usage increases with most iterations, even though I do not keep any references to prior models or their results. At some point, I get an OOM error. As I do not keep any references, I would expect any data to be garbage collected to prevent the OOM from happening.

Steps/Code to reproduce bug Here is a link to my Jupyter notebook on Google Colab: https://colab.research.google.com/drive/1mZew58DdWdI2cBuSRW5F3uUD7lXjHMUk

The issue occurs in the code segment

for i in itertools.count():
  cuml.UMAP(n_neighbors=15) \
      .fit(data, knn_graph=knn_graph)

Looking at the GPU memory usage over time, I can see that the model is not always garbage collected between iterations. The data accumulates over a few iterations and is then deleted every so often, but not all of it. At some point this seems to always lead to an out of memory error. With the input data shape I chose for the Colab demo this took a lot longer than I expected (approx. 20 minutes, 641 iterations), but I think plotting the memory usage over time shows the issue quite nicely:

Memory Usage

With larger datasets such as those that I used when I originally encountered this issue, the OOM happens after way fewer iterations, maybe 10. In the image you can see small and large "teeth". I think when I originally encountered this issue, I had the OOM on one of the small teeth, even before the first large drop in memory usage.

Expected behavior I would expect that the memory usage does not increase to the point of an OOM error.

Environment details

cjnolet commented 3 years ago

Hi @ietz, thank you are filing an issue for this. I ran your script on my V100 (rapids 21.08 nightly packages) and was able to reproduce the trending sawtooth pattern that you've pointed out. I ran the loop for about 25 minutes while running a watch -n 0.1 nvidia-smi in a separate window and noticed it peaked around 12-14gb but didn't go any higher.

Adding a gc.collect() after each loop iteration seemed to make it consistently peak around 4gb and revert to the same value (+= 0.1gb) after the loop. If you are able, can you try adding the gc.collect() after each iteration and let us know if it fixes the problem?

ietz commented 3 years ago

Hey @cjnolet and thank you for your response

The gc.collect() call after every iteration indeed does resolve my issue, and my parameter sweep now finished without any further complications. I was not aware of this command and read in some other issue here that just using del to delete the reference should be enough. Thank you!

If you still want to reproduce the OOM without gc.collect() you could increase the size of the data array. With a shape of (1_000_000, 500) I got an OOM after just 10 iterations, less than 1min of execution time. With that shape the memory usage after the iterations was at 4.4, 6.3, 8.2, 10.1, 12.0, 6.3, 8.2, 10.1, 12.0, 13.8 gb, followed by the OOM.

In terms of results, it sadly seems that even with my parameter sweep I could not get outputs from cuML UMAP that are comparable to those of the umap-learn library, as ~¼ of points are mapped to strange outlier positions far away from the main structure. I guess I'll just watch #3467 and try again once that is resolved

cjnolet commented 3 years ago

As a result of your experience with this problem in RAPIDS, do you think it might be helpful if we added some documentation about the use of gc.collect()? If so, we can convert this issue over to a feature request.

ietz commented 3 years ago

Sure, I think some info about that might very well help someone as long as you can find it. My problem was that I thought I had to look for some RAPIDS-specific solution as the problem was about GPU memory. As it's just standard Python, a short "gc.collect also works with RAPIDS" would probably have been enough in my case.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.