tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
8.65k stars 430 forks source link

Heap memory leak from `AutodiffTensor` #2333

Closed AsherJingkongChen closed 1 week ago

AsherJingkongChen commented 2 weeks ago

Describe the bug

The memory leaks come from AutodiffTensor. They only appear in the case:

  1. The backend is Autodiff<B> (AD-enabled).
  2. The model/param/tensor requires grad (Tracked).

I profile a simple example using Apple's Instruments.app (v15.3), and I find memory leaks.

To Reproduce

I opened a repository as an example of memory leaks. It trains a simple linear model. Without Apple's Instruments.app, you can find memory leaks using Valgrind.

You can clone the example git repository here: https://github.com/AsherJingkongChen/burn-model-linear.git

Expected behavior

I expect no memory leak comes from any tracked AutodiffTensor.

Screenshots

Desktop

Additional context

I think there are some circular references of rc: Arc<NodeId> after the step is tracked. Sadly, it is too hard to locate the bug. I need tools to show me where circular references are.

The leak problem would be severe when I frequently replace the autodiff tensors or train a large model in hours. 🥺

laggui commented 1 week ago

Thanks for flagging this and linking the MWE 🙏

I reproduced the issue with your MWE as well as another training example - looks like some nodes are not removed from the graph hashmap.

The capacity increases pretty slowly over time as you've noticed. So while not super critical, still not great 😅

AsherJingkongChen commented 1 week ago

Thanks for flagging this and linking the MWE 🙏

I reproduced the issue with your MWE as well as another training example - looks like some nodes are not removed from the graph hashmap.

The capacity increases pretty slowly over time as you've noticed. So while not super critical, still not great 😅

@laggui

Thank you for spotting the problem out. I am still thinking solutions. I think checking the nodes in the graph would help.

It would be severe only in some cases, especially frequently updating/replacing parameters. Actually, I found this because my another program (WIP) caused 30GB swaps.

AsherJingkongChen commented 1 week ago

The example (MWE) has new branch wgpu/eager

Description

Screenshots

laggui commented 1 week ago

Was going to investigate this one but I stumbled upon this possibly linked issue #2042.

I think these are two separate issues you're observing with your new branch, though both of these should be fixed.

I'm working on something right now but I'll have to prioritize these as soon as I can.

AsherJingkongChen commented 1 week ago

@laggui

Yes, I think these two issues have different causes (burn-autodiff and cubecl), too.

I think they have correlations or similar reasons since the hash table has leak, too.

I should dive into the heap profiles on different backends.

AsherJingkongChen commented 1 week ago

Added branch bench/heap-leak at example

Description

To see which backend setting has leaks and trace their heap profiles.

Result

I take snapshots every 10 seconds for 1 minute using Apple's Instruments.app

Binary Name Leaks on debug? Leaks on release?
ndarray-ad-false X X
ndarray-ad-true O O
wgpu-ad-false O O
wgpu-ad-true O O

-ad- refers to autodiff feature

AsherJingkongChen commented 1 week ago

Question

Will #1962 (no dealloc) be recognized as a heap memory leak?

Findings

On wgpu-ad-false, the commit 51aea94a3048ed7a76c491339e93151fc0266aa2 leaks while the parent doesn't.