Closed AsherJingkongChen closed 1 week ago
Thanks for flagging this and linking the MWE 🙏
I reproduced the issue with your MWE as well as another training example - looks like some nodes are not removed from the graph hashmap.
The capacity increases pretty slowly over time as you've noticed. So while not super critical, still not great 😅
Thanks for flagging this and linking the MWE 🙏
I reproduced the issue with your MWE as well as another training example - looks like some nodes are not removed from the graph hashmap.
The capacity increases pretty slowly over time as you've noticed. So while not super critical, still not great 😅
@laggui
Thank you for spotting the problem out. I am still thinking solutions. I think checking the nodes in the graph would help.
It would be severe only in some cases, especially frequently updating/replacing parameters. Actually, I found this because my another program (WIP) caused 30GB swaps.
wgpu/eager
Description
Wgpu
backendScreenshots
In Apple's Instruments.app, I can see an allocation slope (90 MB -> 120 MB)
:
Was going to investigate this one but I stumbled upon this possibly linked issue #2042.
I think these are two separate issues you're observing with your new branch, though both of these should be fixed.
I'm working on something right now but I'll have to prioritize these as soon as I can.
@laggui
Yes, I think these two issues have different causes (burn-autodiff
and cubecl
), too.
I think they have correlations or similar reasons since the hash table has leak, too.
I should dive into the heap profiles on different backends.
bench/heap-leak
at exampleTo see which backend setting has leaks and trace their heap profiles.
I take snapshots every 10 seconds for 1 minute using Apple's Instruments.app
Binary Name | Leaks on debug? | Leaks on release? |
---|---|---|
ndarray-ad-false |
X | X |
ndarray-ad-true |
O | O |
wgpu-ad-false |
O | O |
wgpu-ad-true |
O | O |
-ad- refers to autodiff feature
Will #1962 (no dealloc) be recognized as a heap memory leak?
On wgpu-ad-false
, the commit 51aea94a3048ed7a76c491339e93151fc0266aa2 leaks while the parent doesn't.
Describe the bug
The memory leaks come from
AutodiffTensor
. They only appear in the case:Autodiff<B>
(AD-enabled).I profile a simple example using Apple's Instruments.app (v15.3), and I find memory leaks.
To Reproduce
I opened a repository as an example of memory leaks. It trains a simple linear model. Without Apple's Instruments.app, you can find memory leaks using Valgrind.
You can clone the example git repository here: https://github.com/AsherJingkongChen/burn-model-linear.git
Expected behavior
I expect no memory leak comes from any tracked
AutodiffTensor
.Screenshots
leak.1.trace.png
: There are 5100 persistent allocations fromAutodiffTensor::new
in about 2000 iterations.Desktop
Additional context
I think there are some circular references of
rc: Arc<NodeId>
after the step is tracked. Sadly, it is too hard to locate the bug. I need tools to show me where circular references are.The leak problem would be severe when I frequently replace the autodiff tensors or train a large model in hours. 🥺