void-linux / void-packages

The Void source packages collection
https://voidlinux.org
Other
2.6k stars 2.16k forks source link

Mesa's OpenCL implementation Clover crashes on `llvm::TargetInstrInfo::ReplaceTailWithBranchTo()` #43921

Open Mek101 opened 1 year ago

Mek101 commented 1 year ago

Is this a new report?

Yes

System Info

Void 6.2.15_1 x86_64 AuthenticAMD notuptodate rrFFFF

Package(s) Affected

mesa-22.3.5_2 mesa-opencl-22.3.5_2 libllvm15-15.0.7_1

Does a report exist for this bug with the project's home (upstream) and/or another distro?

Nope

Expected behaviour

The system should run the Tensorflow model with OpenCL 1.1

Actual behaviour

Running some models causes the OpenCL implementation (Mesa's Clover) to call llvm::TargetInstrInfo::ReplaceTailWithBranchTo() from /usr/lib/libLLVM-15.so, which itself is provided by the libllvm15 package. This function call causes a segmentation fault (verified via gdb).

Steps to reproduce

In my particular case I'm using an old Radeon 8670D, a Terascale 3 iGPU with the radeon kernel driver.

Squeezenet: https://tfhub.dev/tensorflow/lite-model/squeezenet/1/default/1

lorn10 commented 1 year ago

Hi!

Just for the sake of completeness it should be added here that this topic may have some direct or indirect relations with an open TeraScale 3 LLVM bug: https://github.com/llvm/llvm-project/issues/54942.

And also here a proposed solution could be the AMDGPU/R600: Special case addrspacecast lowering for null LLVM fix. :wink:

Based on the available information, the proposed LLVM fix should be applied to all Radeon GPUs with "flat address spaces" limitations which would mean TeraScale 2/3 and even GCN1.

Edit: More information can be found also in the TeraScale 2 LLVM bug report here: https://github.com/llvm/llvm-project/issues/55679

Mek101 commented 1 year ago

The fix seems to be related to llvm 16 and 17 however. Does it need to be backported?

lorn10 commented 1 year ago

Yes, in theory that "flat address spaces" LLVM fix should be back-ported (because it seems quite important especially for older Radeon hardware) but most likely it won't be back-ported.

Until now, this important LLVM fix has not even been tested extensively. I must admit, I was so far simply not enough skilled in building an own LLVM plus Mesa. Will see when I find the motivation and time for that little adventure. :wink:

Mek101 commented 1 year ago

The question is if the file was substantially changed from 15 to 16. If it wasn't, I think one should be able to derive a patch from the commit and apply it during the package build by putting it in a patches/ subdirectory

lorn10 commented 1 year ago

Yeah, maybe someone should ask the original author arsenm if that would be possible. :+1:

As told, I am at the moment not ready to dive into this matter. :wink:

Mek101 commented 1 year ago

It's fine. I managed to make a patch (only 60 lines). I'll try fixing it for xbps-src and build the whole thing later

lorn10 commented 1 year ago

Great!

And it looks that regarding the r600 driver + clover only Mesa MR https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7357 would be relevant to expose OpenCL 1.2 support. (The other two MRs are radeonsi related.)

In the last comment of Dave Airlie (in Mesa MR 7357) it is noted that that this MR includes as the last piece an improved "AMD LLVM backend patch". Everything other landed in separate commits. Whatever, that Mesa MR 7357 has been stalled for 2 years so it looks that at least a re-base would be needed to get it working again in 2023. :wink:

Long story short, every testing result is of interest especially in regard to rusticl. So please report any relevant result back in Mesa Tracker Rusticl on r600. Thanks!

lorn10 commented 1 year ago

Final addition, - when you get the chance please also try clpeak with this R600 LLVM fix and your Radeon HD 8670D.

Maybe it works on TeraScale 3 also without any patching but on TeraScale 2 based GPUs (like Juniper, Turks, Cayman) that little bench tool is broken since literally ever. (Mesa bug 586 has more information about it.)

Mek101 commented 1 year ago

Patched with https://github.com/llvm/llvm-project/commit/17069608940d22cd6266afb948443b11793f0a57

Still segfaults :disappointed:

Mek101 commented 1 year ago

According to gdb the segfault comes from a getNop(), however considering that it also gave me that function before I installed the debug package for the original libllvm15.so, it's likely that the function segfaulting is still llvm::TargetInstrInfo::ReplaceTailWithBranchTo()

Mek101 commented 1 year ago

Here is the patch for reference. It compiles fine. AMDGPU-R600-Special-case-addrspacecast-lowering-for-null.patch.txt

Mek101 commented 1 year ago

And it looks that regarding the r600 driver + clover only Mesa MR https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7357 would be relevant to expose OpenCL 1.2 support. (The other two MRs are radeonsi related.)

In the last comment of Dave Airlie (in Mesa MR 7357) it is noted that that this MR includes as the last piece an improved "AMD LLVM backend patch".

It seems that the only changes in the commit are those for a printf support?

Mek101 commented 1 year ago

Final addition, - when you get the chance please also try clpeak with this R600 LLVM fix and your Radeon HD 8670D.

Maybe it works on TeraScale 3 also without any patching but on TeraScale 2 based GPUs (like Juniper, Turks, Cayman) that little bench tool is broken since literally ever. (Mesa bug 586 has more information about it.)

I can't find it in the repos

lorn10 commented 1 year ago

It seems that the only changes in the commit are those for a printf support?

Yes, that's exactly the last thing (on Radeon GPUs) which is/was missing to advertise full OpenCL 1.2 support in clover. :+1:

I can't find it in the repos

Okay, then it is not avaiable in the void Linux repos. On Debian/Ubuntu it is present so I have only to enter:

sudo apt install clpeak or sudo snap install clpeak

Whatever, - thanks for testing! And maybe you open a new bug report on the llvm-project issue tracker about the llvm::TargetInstrInfo::ReplaceTailWithBranchTo() error. :wink: