llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.82k stars 11.46k forks source link

Intermittent crash in LLVM when used from OSL (Open Shading Language) in interactive use case (seemingly due to insufficient 32 bit relocation offsets?) #65641

Open ZapAndersson opened 1 year ago

ZapAndersson commented 1 year ago

I reported this bug first as an OSL bug here: https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/issues/1712 but it seems the "bug" is in LLVM itself.

TL;DR - in a heavey interactive session when compiling OSL shaders while rendering high-resolution images, we sometimes, super-intermittently, seemingly randomly, end up crashing (= fatal error aborting) here:

image

Basically, whatever assumptions is being made of memory blocks being within 2Gb of each other, do not hold (probably cause other threads have been allocating multi-gigabyte chunks in the meantime).

This of course explains the intermittent behavior of the bug, and that it only happens while heavy things are happening concurrently in other threads (which has made it quite difficult to track down.)

Not understanding much of LLVM, but realizing that 32 bit offsets is highly likely not to be enough, as a test, I made an ugly hackaround and replaced the use of IMAGE_REL_AMD64_ADDR32NB relocations (which seems to be the 32 bit offsets) with IMAGE_REL_AMD64_ADDR64 (which looks to my eye as full 64 bit offsets?) and the problem goes away!!

But being an utter LLVM coding noob, I sincerely doubt this is the "right solution".

For the "long explanation", see the issue in the OSL GitHub linked above, including my "godawful hackaround".

What is the "right solution" here, or are we doing something fundamentally wrong somewhere?

ZapAndersson commented 1 year ago

This "godawful hack" does, indeed, make the problem go away. But does it make ANY sense?

image

efriedma-quic commented 1 year ago

Messing with the relocations is just going to corrupt your code. You need to fix the code generation to choose instructions that have operands with the right width.

I think JIT compilation tries to default to the large code model, which should avoid relocations like this, but maybe you're somehow setting the wrong code model, or maybe there's some code in LLVM that isn't checking the current code model correctly.

llvmbot commented 1 year ago

@llvm/issue-subscribers-orcjit

efriedma-quic commented 1 year ago

Looked briefly for what might be generating that particular relocation, and mostly what comes up is stuff related to exception-handling. So maybe not fixable by just making the fields wider; I think those datastructures actually require a 32-bit offset. So maybe you need to rearrange memory so the relevant bits are actually within 2GB.

Or maybe if you don't need exception-handling, you can disable it.

lhames commented 1 year ago

@ZapAndersson your hack replaces a relocation that writes a 32-bit value with one that writes a 64-bit value, so it'll be trashing the following 4 bytes everywhere. If IMAGE_REL_AMD64_ADDR32NB is only used in exception tables and you don't need those you might get away with that, but it's definitely dangerous and not a viable long-term solution.

I think the canonical Right Answer here is for LLVM to use JITLink (the new JIT linker) by default for COFF/x86-64, and for OSL to use a slab-based allocator that reserves a 2Gb chunk of address space up front. If it's a long-running shader session OSL would need to either free unused JIT'd code to stay under the 2Gb limit, or distribute JIT'd code amongst multiple JITDylibs so that each stays under its respective 2Gb limit.

Disabling or stripping out exception tables (if that's the only place this relocation comes up) is also an option, but seems more brittle to me.

@sunho Do you happen to remember where IMAGE_REL_AMD64_ADDR32NB gets used?

lhames commented 1 year ago

@ZapAndersson We should also check out the large code model option as @efriedma-quic suggested. If it's not the default then switching to it may be an easy fix for now.

sunho commented 12 months ago

Note that IMAGE_REL_AMD64_ADDR32NB relocation is only used for seh frame (i.e. for stack walking) and debugging. If you don't use the windows exception or debug info, you can just disable those relocations on the RuntimeDyLD side as hack. (only IMAGE_REL_AMD64_ADDR32NB though, disabling other relocation will indeed break the code) If you want to be correct about these, you do need to use JITLink which does guarantee the offsets to be within 32 bit. I think hacking on codegen side is incorrect even as a hack since It will modify more bytes than intended. Just comment out the line you are seeing in the first screenshot.

ZapAndersson commented 12 months ago

Thank you guys!

Yeah I need to be clear, I have absolutely no idea what I am doing here ;)

So can I have a bit of an "explain like I'm five" explanation of how to

a) "disable exceptions" or b) "choose a large code model"

To me, though, it doesn't seem like the code itself is large (it kinda isn't) but when other code in other threads are allocating huge chunks of memory (because it's rendering Godzilla stomping a skyscraper with gazillions of polygons) we happen to get allocations that are far apart?

But I have no idea whatsoever how LLVM's memory handling works, there's the cryptic (to me) comment in the code about the "The MemoryManager can make sure this is always true by forcing the memory layout to be: CodeSection < ReadOnlySection < ReadWriteSection", which sadly is greek to me.

To be clear, I didn't write this code, I know pretty much less than zero about this code, I'm just running into this bug, been stabbing at it with a glowy stick for a workaround, and need help from anyone who actually knows sometihing, so any gem of wisdom is HIGHLY APPRECIATED, and THANK YOU ALL for any insight you can give me.

ZapAndersson commented 12 months ago

Just comment out the line you are seeing in the first screenshot.

So are you saying that if we con't need debugging or stack walking (as far as I know, we don't) this message is pointless and won't actually break anything, so I can just comment out the fatal error itself and be on my merry way, or what are you actually telling me? (Total LLVM Noob warning, sorry).

Again, thank you for all the answers, they are just a hair over my LLVM understanding (which is zero)

ZapAndersson commented 12 months ago

I think I figured out where (inside OSL) the code model is picked, there is currently a commented out line that says https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1442

//engine_builder.setCodeModel(llvm::CodeModel::Default);

...and I see no other place in OSL where it references the "CodeModel". I will poke at this with a blunt stick and see if it changes anything.

lhames commented 12 months ago

Yeah I need to be clear, I have absolutely no idea what I am doing here ;)

No worries. You're into a pretty obscure area of the stack here. Many LLVM engineers would be asking the same questions if they hit this. :)

I just tried using the large code model*, but it didn't seem to help:

% cat hw.c
int printf(const char * restrict format, ...);

int main() {
  printf("hello, world!\n");
  return 0;
}
% clang --target=x86_64-pc-windows-msvc -mcmodel=large -c -o hw.coff.o hw.c
% llvm-objdump -r hw.coff.o                                                      

hw.coff.o:      file format coff-x86-64

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE                     VALUE
000000000000000e IMAGE_REL_AMD64_ADDR64   ??_C@_0P@LGCPODBN@hello?0?5world?$CB?6?$AA@
0000000000000018 IMAGE_REL_AMD64_ADDR64   printf

RELOCATION RECORDS FOR [.pdata]:
OFFSET           TYPE                     VALUE
0000000000000000 IMAGE_REL_AMD64_ADDR32NB .text
0000000000000004 IMAGE_REL_AMD64_ADDR32NB .text
0000000000000008 IMAGE_REL_AMD64_ADDR32NB .xdata

So are you saying that if we con't need debugging or stack walking (as far as I know, we don't) this message is pointless and won't actually break anything, so I can just comment out the fatal error itself and be on my merry way, or what are you actually telling me? (Total LLVM Noob warning, sorry).

Yep -- if IMAGE_REL_AMD64_ADDR32NB is only being used to fix up metadata that you don't actually need, which @sunho's comment seems to confirm, then you can just comment out that fatal error. Your unused metadata will be wrong, but it's unused so no big deal.

In general we do want this metadata to be usable, so from our perspective the proper general case fix is two-part: (1) LLVM uses JITLink for COFF/x86-64, and (2) Clients use the MapperJITLinkMemoryManager and reserve 2Gb up front. OSL should aim for this in the future, but the LLVM change has to happen first.

efriedma-quic commented 12 months ago

Large code model isn't a thing on Windows: the COFF spec has 4-byte fixed-width fields that restrict the size of the resulting image. So the exception-handling metadata also has 4-byte fixed-width fields; see https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64 .

lhames commented 12 months ago

@efriedma-quic Thanks -- it's great to get confirmation of that. Do you think it'd be worth adding a warning / error when -mcmodel=large is used with a COFF target?

Either way this supports the long term plan I outlined above to move to JITLink and require slab allocation.

ZapAndersson commented 12 months ago

No worries. You're into a pretty obscure area of the stack here. Many LLVM engineers would be asking the same questions if they hit this. :)

Thank you, that make me feel at least slightly less like an idiot..... slightly .... :)

So tell me more about these memory managers, you can see the way OSL sets this up in this function https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1430 and it is just doing a new MemoryManager(m_llvm_jitmm))); which I presume is just instantiating some default memory manager(!?!?)

Should this code look... different... somehow?

I appreciate all the handholding, I truly do. Cavamen with a blunt stick here....

/Z

lhames commented 12 months ago

So tell me more about these memory managers, you can see the way OSL sets this up in this function https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1430 and it is just doing a new MemoryManager(m_llvm_jitmm))); which I presume is just instantiating some default memory manager(!?!?)

Sure thing. In both RuntimeDyld (the older JIT linker that MCJIT uses), and JITLink (the new JIT linker) memory is allocated via abstract memory manager interfaces. The interfaces differ between the two, but at a high level they're both responsible for reserving memory for JIT'd code and data, transporting the contents of linker memory into the address spaces that they reserved (this can include transporting it across process boundaries), applying memory protections, and eventually freeing the memory when it's no longer needed.

Looks like OSL has a custom memory manager that forwards to an LLVM manager under the hood, but with a twist:

/// MemoryManager - Create a shell that passes on requests
/// to a real LLVMMemoryManager underneath, but can be retained after the
/// dummy is destroyed.  Also, we don't pass along any deallocations.
class LLVM_Util::MemoryManager final : public LLVMMemoryManager 

Should this code look... different... somehow?

This code will need to change in the future when OSL switches to JITLink (which OSL will want to do at some point because we're going to deprecate MCJIT in the not too distant future).

It's not clear to me whether OSL will need a custom allocator, or whether the new off the shelf allocators will work. It depends on why OSL wants to retain that memory. Do you know anyone who might know why it's doing this?

I appreciate all the handholding, I truly do. Cavamen with a blunt stick here....

No worries at all!

ZapAndersson commented 12 months ago

With my limited understanding, peeking at this memory manager code, it seems that "all" the OSL memory manager is doing is wrapping an existing llvm::SectionMemoryManager object and holding on to it, such that many threads can use the same memory manager, and once all the threads are done, the memory manager can go out of scope and clean up. https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L119C9-L119C35

Or, well, that's my approximation of an understanding of what it is supposedly doing.

Is this... bad?

ZapAndersson commented 11 months ago

Okay, here's what I've figured out so far.

OSL uses a memory manager, that is held by rendering threads per-thread-info stuff. And this memory manager is kept around until the last rendering thread dies.

Except... we use TBB for rendering. TBB actually has a set of worker threads that are always in flight. So those threads never die.

So the memory manager ends up being kept around forever.

That wouldn't be a big deal, in the normal case. Except I also see this:

image

Soo, if the wrapper doesn't pass on deallocations, it will end up allocating forever, eventually breaking the 2Gb barrier and triggering this fatal error problem....!? Maybe?

lhames commented 11 months ago

Soo, if the wrapper doesn't pass on deallocations, it will end up allocating forever, eventually breaking the 2Gb barrier and triggering this fatal error problem....!? Maybe?

It's hard to tell without looking at the failure case (you could add an allocation counter to the allocator to test your theory). It seems like a very good explanation though.

The other possibility is that you get unlucky with an OS mmap / VirtualAlloc call before you ever hit the 2Gb limit. In theory you could make two requests for 1 page each and the OS could choose to hand you back pages that are more than 2Gb apart.

Either way a switch to ORC + JITLink + SlabAllocation will fix this: It supports thread-safe deallocation of each object file / module added to the JIT.

If you want to consider a switch the next question is what platforms OSL needs to run on (OS + architecture). I can let you know whether ORC supports them all, and suggest next steps for moving over.

ZapAndersson commented 11 months ago

In my particular case (for 3ds Max) it's Windows x64 but other apps of ours (like the Arnold renderer) are Linux x64 and Mac (64 bit) as well.

I'll bring all this up at the OSL Technical Steering Comittee meeting that is later today, se if we can figure out what is best.

lhames commented 11 months ago

Sounds good. JITLink is already mature for x86-64 on Linux, macOS and Windows, and for arm64 on Linux and macOS. My first impression is that it's worth testing out a port.

ZapAndersson commented 11 months ago

We also use some black voodoo to turn it into PTX for running on NVidia GPU's (way above my paygrade).

I think the potentially bigger issue is what LLVM versions we can accept. Since OSL is something integrated into other apps, often OSL needs to match the LLVM version used with one used by the APP in some cases. (Or so I'm told, I could be wrong)

lhames commented 11 months ago

Ahh. Yeah -- that might be a bigger problem. Please let me know what you find out.

ZapAndersson commented 11 months ago

Want to join in on the OSL Technical Steering Comittee meeting, we'll be talking about this there? It's in about 50 minutes from me posting this message, on Zoom. :)

No pressure :) but you would be talking directly to the people who actually understand something, instead of me :)

Reminder: OSL TSC meeting ( every other week )

When: Thursday, September 14, 2023 2:00pm to 3:00pm (UTC-07:00) America/Los Angeles

Where: https://zoom.us/j/100511909

Organizer: Chris Kulla ckulla@gmail.com

View Event

Description: Every other week meeting of the OSL TSC.

lhames commented 11 months ago

Hi Zap,

Any idea when the JIT part would be discussed? I'm flat out at the moment and can't spare an hour, but could probably drop in for 20 mins.

lhames commented 11 months ago

Actually I've got a 2:30 meeting anyway -- So I'll come for the first 30. :)

ZapAndersson commented 11 months ago

Oh I expect us to start with that

lhames commented 11 months ago

@ZapAndersson Looks like I need to be an authorized participant.

Could you ask them to invite me? I'm lhames@gmail.com.

ZapAndersson commented 11 months ago

thank you so much for this!!

lhames commented 11 months ago

No worries. :)

ThiagoIze commented 11 months ago

I don't think this is related to the crash, but why do we keep saying 2GB when the code is comparing against a UINT32_MAX -- isn't that 4GB? Is the code wrong (should be INT32_MAX) or the comment?

lhames commented 11 months ago

@ThiagoIze 2Gb came from the comment in the original code, but I think you're right -- Microsoft's COFF document at https://learn.microsoft.com/en-us/windows/win32/debug/pe-format just defines IMAGE_REL_AMD64_ADDR32NB as "The 32-bit address without an image base (RVA)" -- no mention of it being signed. The limit is probably 4Gb.

lhames commented 11 months ago

We see similar overflows in signed 32-bit relocations -- that may have led to the typo in the original code.