llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.47k stars 11.77k forks source link

llvm:SelectionDAG: memory leak caused by static std::set EVTs #88233

Open hensz opened 6 months ago

hensz commented 6 months ago

Hi!

We're using LLVM for jit compilation in our in-memory database, therefore memory handling is crucial. While switching from llvm 13 to 18 we saw that the std::set EVTs causes a memory leak in SDNode::getValueTypeList(). As far as I understand the code, for each extended EVT (we're using i96) a new value is inserted to the set if the pointer of the Type is different. An this does happen for identical types, if the compilation is done with a new instance of the llvm classes.

A possible fix for this leak would be to make EVTs a member of SelectionDAG. However, SDNode::getValueTypeList() has no access to the SelectionDAG instance, and passing EVTs as parameter would require many further changes, since getValueTypeList is called from SDNode::getSDVTList() and this is called in many locations, also outside of SelectionDAG.

What can we do to close this memory leak?

@nhaehnle, you switched EVTs from a managed static to a local static. Do you know how to handle the situation in a better way?

marcauberer commented 6 months ago

@nhaehnle Would it be an option to revert to ManagedStatic for this method? We currently have no better idea how to protect this against memory leaks. We can take over patching if we have a decision on how to proceed.

cc @RKSimon @topperc

nhaehnle commented 6 months ago

That std::set is going to grow based on the types you're using -- the choice of local static vs. ManagedStatic doesn't change that.

Either way, the memory ought to be freed when LLVM is unloaded (global destructors are run).

What is your actual problem? If the problem is that this set is getting too large for some reason in your setup, that would require a much more fundamental redesign of the code.

hensz commented 5 months ago

@nhaehnle The actual problem is that the set is continuously growing, since the set uses the pointer to the Type as key, and this pointer might change for every compilation. And even if we could clear the set in our code (which is quite difficult due to concurrency), we cannot do this technically, since unloading of LLVM is not an option.

So reverting to ManagedStatic would not solve the memory leak as such, it would only help us in tests.

nhaehnle commented 5 months ago

Right, that's what I suspected.

Changing the behavior here is probably quite difficult -- as you already noted, a huge amount of code is written under the assumption that EVTs can be accessed without reference to any other object (like the SelectionDAG or the LLVMContext). Changing this assumption would be a Herculean task.

And of course, SelectionDAG can be used in a multi-threaded context, so there is no trivial point at which the EVTs container can be cleared. Reference-counting is also a bad idea (not to mention it's far from obvious how we'd implement it).

The one thing that I can think of if you do want to tackle this is to take inspiration from generational GC approaches, based on the knowledge that pointer-to-elements of the list can only be live for the SelectionDAG lifetime. Basically:

(This isn't actually a real generational GC. But it's something that does GC, and it uses generations, so.... :) -- you could also think of it as taking some inspiration from read-copy-update implementations.)

hensz commented 5 months ago

Indeed this sounds like not reasonable. But how about using a different key for the EVT? In our case, we only use i96, so if we could identify existing i96 Types in EVTs, there would be no need to create new ones for every compilation, thus no leaks. I understand that currently using the pointer to the type is the easiest way. Also I don't know the internas good enough to come up with a better key for a Type.

nhaehnle commented 5 months ago

I just realized that EVTs store (and are identified by, for extended ones) llvm::Type*, and llvm::Type is owned by LLVMContext, so if you keep creating and destroying LLVMContexts, it's very likely that the EVTs set becomes populated with dangling pointers -- which is strictly speaking iffy if a later LLVMContext allocates another Type object at the same address as an earlier did.

So perhaps yet another alternative to my "generational GC" approach would be to have separate EVT sets per LLVMContext, and to destroy them when LLVMContexts are destroyed. How exactly this would work is a bit questionable because we don't want circular dependencies between the core component that defines LLVMContext and CodeGen. Perhaps you could take some inspiration from what we've built in llvm-dialects here:

This will automatically free memory when the LLVMContext is destroyed. Plus, you can remove the mutex, since LLVMContext are single-threaded.

hensz commented 5 months ago

Thanks a lot for the suggestion, this sounds good. However I doubt that we will have the time to tackle this. We're currently in the process of migrating from LLVM 13 to 18, and there's still very many things to do, and this migration has priority. We are thinking about preventing these memory leaks by not using extended EVTs - currently we only have one type that uses this kind of EVT, so handling this differently sounds feasible.

topperc commented 5 months ago

I wonder if we can use the VTListMap that lives in SelectionDAG for extended EVTs instead of the std::set. That map is used for nodes that produce more than 1 result. It would mean we don't keep that EVT object life across multiple SelectionDAGs. It may have an increase in compile time for functions that have heavy use of extended VTs, but maybe its acceptable?

topperc commented 5 months ago

I wonder if we can use the VTListMap that lives in SelectionDAG for extended EVTs instead of the std::set. That map is used for nodes that produce more than 1 result. It would mean we don't keep that EVT object life across multiple SelectionDAGs. It may have an increase in compile time for functions that have heavy use of extended VTs, but maybe its acceptable?

Nevermind that doesn't work because there are bunch of subclasses of SDNode that need to create a SDTVTList without a SelectionDAG object.

TorstenIhben commented 2 months ago

While the change from Craig is indeed a nice improvement, the issue itself is not solved. Any idea or plans how to tackle this?

nhaehnle commented 2 months ago

I sketched an idea back in April, but I don't think anybody is planning to work on it or any alternative. So it's free for any takers :)

topperc commented 2 months ago

I believe this is one attempt I made at fixing it by using the existing FoldingSet when the type is an EVT. https://github.com/llvm/llvm-project/commit/42424c3f5fd2b5e3d204f1bfc7c04e6d447c1653

This was the compile time tracker results from that https://llvm-compile-time-tracker.com/compare.php?from=c5dcb5239e5a3ee68155ba2d09d1fa37ca512cd7&to=42424c3f5fd2b5e3d204f1bfc7c04e6d447c1653&stat=instructions:u

TorstenIhben commented 2 months ago

@topperc I don't fully understand the code. Currently extended EVTs use the pointer to the type as key. What's used as key with your change? Sorry, I've not worked with the code for quite some time. The compile time tracker results look ok-ish for me, though I cannot really judge the numbers.

topperc commented 2 months ago

@topperc I don't fully understand the code. Currently extended EVTs use the pointer to the type as key. What's used as key with your change? Sorry, I've not worked with the code for quite some time.

The global storage that has a problem is only used for VTLists containing a single VT. VTLists containing 2 or more VTs use a separate FoldingSet owned by SelectionDAG. My change uses the existing FoldingSet for VT lists containing a single VT if that VT is not simple.

The compile time tracker results look ok-ish for me, though I cannot really judge the numbers.

They might be ok. I was hoping for no red. I was surprised that non-simple VTs show up often enough to make a measureable difference. But maybe there's some other reason I don't understand.

TorstenIhben commented 1 week ago

Hi @topperc, we definitely want to have a look at your change with our test case that detected the memory leak. Currently we're at migrating from LLVM 18 to LLVM 19 (we're progression quite fast), but once this is done, we will try out your fix. Can you perhaps also try your fix on LLVM 19 and evaluate the compile time again? We saw that compile time changed a bit in 19, so if we're lucky, your fix will behave nicer wrt. compile time :-)