Closed foutrelis closed 2 years ago
(For the record, it also crashes compilation of darktable opencl kernels)
Many thanks for the detailed report! This is indeed LLVM 14 specific (also reproducible with earlier LLVM 14 tags). Apparently, the root cause lies in erroneous SPIR-V generation by OpenCL Clang - kernel functions get duplicated within the SPIR-V module (not the case with pre-LLVM 14).
During SPIR-V -> LLVM translation, the duplicate functions themselves are ignored by the mapping mechanism, however the metadata analysis is not so clever, so !{opencl.kernels}
metadata recieves 2 similar entries for each function.
Meanwhile, IGC's kernel argument metadata analysis assumes that each entry refers to a unique function, and, while mapping IGC::FunctionMetadata*
onto llvm::Function*
entities, unknowingly "doubles" each kernel-specific list of argument information. Finally, LLVM's Function::arg_iterator
gets advanced onto invalid memory, since the iteration mechanism assumes that the size of IGC::FunctionMetadata*
's vectors (count
) is less or equal to llvm::Function*
's argument count.
In fact, even if it wasn't for out-of-bounds access crashes, we'd still be having a bug with "extra" argument metadata being applied to irrelevant function arguments. This would've probably been harder to catch.
Speaking of IGC functionality, we could implement a simple workaround either by guarding against this "doubling" of argument information list, or by revoking the llvm::Argument *
iteration mechanism to not depend on IGC::FunctionMetadata
's collection size. However, first I'll try to determine the OCL Clang-level root cause of incorrect SPIR-V generation - after all, it's also specific to the LLVM 14-compatible version of opencl-clang
.
After debugging the OCL Clang compilation (LLVM -> SPIR-V conversion, to be more exact), I've come to think we're essentially dealing with the consequences of https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/85815e725ce5bdc970b812b4bbff73d4b2a44046.
So it turns out that the actual duplication of kernel argument metadata entries was happening in SPIR-V -> LLVM translation, induced by the difference between the Khronos Translator and our internal SPIR-V Reader copy in IGC/AdaptorOCL/SPIRV
(in essence, by the generation of !opencl.kernels
module metadata). For upstream translation cases, the "duplicate" metadata entries stemming from entry point kernel wrappers completely override the metadata generated for the actual kernel functions, however in our case, root nodes for all kernels (including the de facto duplicates) get inserted into !opencl.kernels
.
I'm currently trying to exclude the possibility for such duplication on the upstream level with https://github.com/KhronosGroup/SPIRV-LLVM-Translator/pull/1526 - once correctly applied onto IGC's SPIR-V consumer, these changes resolve the issue.
@AGindinson Since you’ve closed that PR, what’s the plan now? This is blocking llvm upgrade in Arch, so if a temporary work-around could be implemented in IGC while the deeper solution is being investigated, that would help us a lot. ;)
@AGindinson Since you’ve closed that PR, what’s the plan now? This is blocking llvm upgrade in Arch, so if a temporary work-around could be implemented in IGC while the deeper solution is being investigated, that would help us a lot. ;)
@ArchangeGabriel Sorry for holding off the update on the matter. I've already prepared an IGC-level workaround and will link the commit once it's merged.
No worry! It’s a good thing that you’re trying to fix things properly, but it’s great too that there is a coming workaround. :)
@AGindinson Can confirm it fixes this issue. :) There is another one still crashing compute-runtime test suite for which @foutrelis is going to open a new issue soon once he will have the backtrace, but for darktable the current fix seems enough (also cc @frantisekz for https://bugzilla.redhat.com/show_bug.cgi?id=2075944) and will allow us to move on with llvm14 –we lived with tests disabled in compute-runtime because of #204 for ~8 months, so not a new situation.
Thanks for ping @ArchangeGabriel ; already on the way to the repositories ( https://bodhi.fedoraproject.org/updates/FEDORA-2022-c3e3ae48a9 ) !
@ArchangeGabriel @foutrelis @frantisekz FYI, there have been further changes in the handling of kernel metadata. My initial fix for this issue resulted in SPIR-V execution mode information loss with LLVM 14 (e.g. breaking intel_reqd_work_group_size
attribute support). The approach has been reworked within https://github.com/intel/intel-graphics-compiler/commit/6a13fa903f380e17378286a7cd43995b0ae162ad - in case LLVM 14 yields runtime failures for some of the tests, retesting with this new commit may give better results.
@AGindinson Thanks for highlighting subsequent relevant fixes. We definitely want to include them in our IGC package which is built against LLVM 14.
in case LLVM 14 yields runtime failures for some of the tests, retesting with this new commit may give better results
I can still repro #250 (for what it's worth 🐭️).
Correction: ocloc doesn't crash when building compute-runtime with current IGC master. I have updated #250 accordingly. The single test failure I'm now seeing may or may not be related to this (closed) issue.
One of the
ocloc
commands that crashes:Running it under gdb shows this backtrace:
Looking around in gdb it appears that
hasByValAttr
is called on an invalid instance:The index value of 5 comes from the
count
variable near the start ofCOpenCLKernel::CreateKernelArgInfo
:Therefore, my conclusion is that the code is trying to access the 6th element of
entry->Arguments
but the latter only has 5 elements. As to why this is happening, perhaps someone more familiar with the code in question (and LLVM in general) might be able to figure out why.I'm using these software versions: