intel / intel-graphics-compiler

Other
596 stars 155 forks source link

[LLVM 14] ocloc crash when building compute-runtime #245

Closed foutrelis closed 2 years ago

foutrelis commented 2 years ago

One of the ocloc commands that crashes:

cd /../compute-runtime-22.23.23405/opencl/test/unit_test/test_files \
&& export LD_LIBRARY_PATH=/../build/bin \
&& /../build/bin/ocloc -q -file \
   /../compute-runtime-22.23.23405/opencl/test/unit_test/test_files/kernel_num_args.cl \
   -device bdw -64 -revision_id 0 -out_dir /../build/bin/Gen8core/0/test_files/x64/

Running it under gdb shows this backtrace:

(gdb) bt
#0  0x00007ffff149c527 in llvm::Argument::hasByValAttr (this=0x555555ce51e8)
    at /usr/src/debug/llvm-14.0.6.src/lib/IR/Function.cpp:109
#1  0x00007fffe8ffa9d8 in IGC::COpenCLKernel::CreateKernelArgInfo (
    this=this@entry=0x5555566ec340)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/Compiler/CISACodeGen/OpenCLKernelCodeGen.cpp:377
#2  0x00007fffe8ffd904 in IGC::COpenCLKernel::AllocatePayload (this=0x5555566ec340)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/Compiler/CISACodeGen/OpenCLKernelCodeGen.cpp:2022
#3  0x00007fffe924d815 in IGC::EmitPass::runOnFunction (this=0x555555689430, F=...)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/Compiler/CISACodeGen/EmitVISAPass.cpp:1137
#4  0x00007ffff14fb5e0 in llvm::FPPassManager::runOnFunction (this=0x5555565ec2d0, F=...)
    at /usr/src/debug/llvm-14.0.6.src/lib/IR/LegacyPassManager.cpp:1434
#5  0x00007ffff14fb724 in llvm::FPPassManager::runOnModule (this=0x5555565ec2d0, M=...)
    at /usr/src/debug/llvm-14.0.6.src/lib/IR/LegacyPassManager.cpp:1480
#6  0x00007ffff14fcfbb in (anonymous namespace)::MPPassManager::runOnModule (M=..., 
    this=0x555555a689f0) at /usr/src/debug/llvm-14.0.6.src/lib/IR/LegacyPassManager.cpp:1549
#7  llvm::legacy::PassManagerImpl::run (this=<optimized out>, M=...)
    at /usr/src/debug/llvm-14.0.6.src/lib/IR/LegacyPassManager.cpp:539
#8  0x00007fffe902d826 in IGC::CodeGen<IGC::OpenCLProgramContext> (
    ctx=ctx@entry=0x7fffffffb360, kernels=...)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/Compiler/CISACodeGen/ShaderCodeGen.cpp:1733
#9  0x00007fffe902def9 in IGC::CodeGen (ctx=ctx@entry=0x7fffffffb360, shaders=...)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/Compiler/CISACodeGen/ShaderCodeGen.cpp:1831
#10 0x00007fffe8ff7c57 in IGC::CodeGen (ctx=ctx@entry=0x7fffffffb360)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/Compiler/CISACodeGen/OpenCLKernelCodeGen.cpp:2472
#11 0x00007fffe8e7140c in TC::TranslateBuildSPMD (pInputArgs=pInputArgs@entry=0x7fffffffcbb0, 
    pOutputArgs=pOutputArgs@entry=0x7fffffffcb10, 
    inputDataFormatTemp=inputDataFormatTemp@entry=TC::TB_DATA_FORMAT_SPIR_V, IGCPlatform=..., 
    profilingTimerResolution=profilingTimerResolution@entry=80, inputShHash=...)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/AdaptorOCL/dllInterfaceCompute.cpp:1298
#12 0x00007fffe8f7a4d0 in IGC::VLD::TranslateBuildSPMDAndESIMD (
    pInputArgs=pInputArgs@entry=0x7fffffffcbb0, pOutputArgs=pOutputArgs@entry=0x7fffffffcb10, 
    inputDataFormatTemp=inputDataFormatTemp@entry=TC::TB_DATA_FORMAT_SPIR_V, IGCPlatform=..., 
    profilingTimerResolution=profilingTimerResolution@entry=80, inputShHash=..., 
    errorMessage="")
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/VISALinkerDriver/VLD.cpp:173
#13 0x00007fffe8e731cd in TC::TranslateBuild (pInputArgs=pInputArgs@entry=0x7fffffffcbb0, 
    pOutputArgs=pOutputArgs@entry=0x7fffffffcb10, inputDataFormatTemp=<optimized out>, 
    IGCPlatform=..., profilingTimerResolution=80)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/AdaptorOCL/dllInterfaceCompute.cpp:1482
#14 0x00007fffe8f619a5 in IGC::IgcOclTranslationCtx<0ul>::Impl::Translate (this=0x555555624ba0, 
    outVersion=<optimized out>, src=<optimized out>, 
    specConstantsIds=specConstantsIds@entry=0x0, 
    specConstantsValues=specConstantsValues@entry=0x0, options=options@entry=0x555555624b40, 
    internalOptions=<optimized out>, tracingOptions=<optimized out>, 
    tracingOptionsCount=<optimized out>, gtPinInput=<optimized out>)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/AdaptorOCL/ocl_igc_interface/impl/igc_ocl_translation_ctx_impl.h:336
#15 0x00007fffe8f62be3 in IGC::IgcOclTranslationCtx<1ul>::TranslateImpl (this=<optimized out>, 
    outVersion=<optimized out>, src=<optimized out>, options=0x555555624b40, 
    internalOptions=0x555555624b70, tracingOptions=0x0, tracingOptionsCount=0)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/AdaptorOCL/ocl_igc_interface/impl/igc_ocl_translation_ctx_impl.cpp:23
#16 0x00007ffff7f08939 in IGC::IgcOclTranslationCtx<1ul>::Translate<IGC::OclTranslationOutput<1ul> > (tracingOptionsCount=0, tracingOptions=0x0, internalOptions=0x555555624b70, 
    options=0x555555624b40, src=<optimized out>, this=0x555555624bd0)
    at /usr/include/igc/ocl_igc_interface/igc_ocl_translation_ctx.h:38
#17 NEO::OfflineCompiler::buildSourceCode (this=0x55555556d0a0)
    at /usr/src/debug/compute-runtime-22.23.23405/shared/offline_compiler/source/offline_compiler.cpp:260
#18 0x00007ffff7f0c359 in NEO::OfflineCompiler::build (this=0x55555556d0a0)
    at /usr/src/debug/compute-runtime-22.23.23405/shared/offline_compiler/source/offline_compiler.cpp:298
#19 0x00007ffff7f51435 in SafetyGuardLinux::call<int, NEO::OfflineCompiler, int (NEO::OfflineCompiler::*)()> (this=this@entry=0x7fffffffe0c0, object=object@entry=0x55555556d0a0, 
    method=<optimized out>, retValueOnCrash=retValueOnCrash@entry=-5152)
    at /usr/src/debug/compute-runtime-22.23.23405/shared/offline_compiler/source/utilities/linux/safety_guard_linux.h:62
#20 0x00007ffff7f51127 in buildWithSafetyGuard (compiler=compiler@entry=0x55555556d0a0)
    at /usr/src/debug/compute-runtime-22.23.23405/shared/offline_compiler/source/utilities/linux/safety_caller_linux.cpp:20
#21 0x00007ffff7ee2787 in oclocInvoke (numArgs=<optimized out>, argv=0x7fffffffe5e8, 
    numSources=0, dataSources=0x0, lenSources=0x0, nameSources=0x0, numInputHeaders=0, 
    dataInputHeaders=0x0, lenInputHeaders=0x0, nameInputHeaders=0x0, numOutputs=0x0, 
    dataOutputs=0x0, lenOutputs=0x0, nameOutputs=0x0)
    at /usr/src/debug/compute-runtime-22.23.23405/shared/offline_compiler/source/ocloc_api.cpp:176
#22 0x000055555555472b in main (argc=<optimized out>, argv=<optimized out>)
    at /usr/src/debug/compute-runtime-22.23.23405/shared/offline_compiler/source/main.cpp:11

Looking around in gdb it appears that hasByValAttr is called on an invalid instance:

(gdb) p *this
$1 = {<llvm::Value> = {VTy = 0x183b1, UseList = 0x7ffff7e5d3b0, SubclassID = 112 'p', 
    HasValueHandle = 0 '\000', SubclassOptionalData = 23 '\027', SubclassData = 21896, 
    NumUserOperands = 21845, IsUsedByMD = 0, HasName = 0, HasMetadata = 0, HasHungOffUses = 0, 
    HasDescriptor = 0, static MaxAlignmentExponent = 32, static MaximumAlignment = 4294967296}, 
  Parent = 0x555556692f20, ArgNo = 1434988144}

(gdb) up
#1  0x00007fffe8ffa9d8 in IGC::COpenCLKernel::CreateKernelArgInfo (
    this=this@entry=0x5555566ec340)
    at /usr/src/debug/intel-graphics-compiler-igc-1.0.11378/IGC/Compiler/CISACodeGen/OpenCLKernelCodeGen.cpp:377

(gdb) p entry->NumArgs
$2 = 5

(gdb) p *(entry->Arguments + 5)
$3 = {<llvm::Value> = {VTy = 0x183b1, UseList = 0x7ffff7e5d3b0, SubclassID = 112 'p', 
    HasValueHandle = 0 '\000', SubclassOptionalData = 23 '\027', SubclassData = 21896, 
    NumUserOperands = 21845, IsUsedByMD = 0, HasName = 0, HasMetadata = 0, HasHungOffUses = 0, 
    HasDescriptor = 0, static MaxAlignmentExponent = 32, static MaximumAlignment = 4294967296}, 
  Parent = 0x555556692f20, ArgNo = 1434988144}

The index value of 5 comes from the count variable near the start of COpenCLKernel::CreateKernelArgInfo:

(gdb) p m_Context->getModuleMetaData()->FuncMD[entry]->m_OpenCLArgAccessQualifiers.size()
$4 = 6

Therefore, my conclusion is that the code is trying to access the 6th element of entry->Arguments but the latter only has 5 elements. As to why this is happening, perhaps someone more familiar with the code in question (and LLVM in general) might be able to figure out why.

I'm using these software versions:

llvm 14.0.6
clang 14.0.6
intel-graphics-compiler 1.0.11378 (also tested current master)
intel-opencl-clang 14.0.0 (commit 8c2aaa2)
spirv-llvm-translator 14.0.0 (commit a16f3db3, also tested current tip of llvm_release_140)
ArchangeGabriel commented 2 years ago

(For the record, it also crashes compilation of darktable opencl kernels)

AGindinson commented 2 years ago

Many thanks for the detailed report! This is indeed LLVM 14 specific (also reproducible with earlier LLVM 14 tags). Apparently, the root cause lies in erroneous SPIR-V generation by OpenCL Clang - kernel functions get duplicated within the SPIR-V module (not the case with pre-LLVM 14).

During SPIR-V -> LLVM translation, the duplicate functions themselves are ignored by the mapping mechanism, however the metadata analysis is not so clever, so !{opencl.kernels} metadata recieves 2 similar entries for each function. Meanwhile, IGC's kernel argument metadata analysis assumes that each entry refers to a unique function, and, while mapping IGC::FunctionMetadata* onto llvm::Function* entities, unknowingly "doubles" each kernel-specific list of argument information. Finally, LLVM's Function::arg_iterator gets advanced onto invalid memory, since the iteration mechanism assumes that the size of IGC::FunctionMetadata*'s vectors (count) is less or equal to llvm::Function*'s argument count.

In fact, even if it wasn't for out-of-bounds access crashes, we'd still be having a bug with "extra" argument metadata being applied to irrelevant function arguments. This would've probably been harder to catch.

Speaking of IGC functionality, we could implement a simple workaround either by guarding against this "doubling" of argument information list, or by revoking the llvm::Argument * iteration mechanism to not depend on IGC::FunctionMetadata's collection size. However, first I'll try to determine the OCL Clang-level root cause of incorrect SPIR-V generation - after all, it's also specific to the LLVM 14-compatible version of opencl-clang.

AGindinson commented 2 years ago

After debugging the OCL Clang compilation (LLVM -> SPIR-V conversion, to be more exact), I've come to think we're essentially dealing with the consequences of https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/85815e725ce5bdc970b812b4bbff73d4b2a44046.

AGindinson commented 2 years ago

So it turns out that the actual duplication of kernel argument metadata entries was happening in SPIR-V -> LLVM translation, induced by the difference between the Khronos Translator and our internal SPIR-V Reader copy in IGC/AdaptorOCL/SPIRV (in essence, by the generation of !opencl.kernels module metadata). For upstream translation cases, the "duplicate" metadata entries stemming from entry point kernel wrappers completely override the metadata generated for the actual kernel functions, however in our case, root nodes for all kernels (including the de facto duplicates) get inserted into !opencl.kernels.

I'm currently trying to exclude the possibility for such duplication on the upstream level with https://github.com/KhronosGroup/SPIRV-LLVM-Translator/pull/1526 - once correctly applied onto IGC's SPIR-V consumer, these changes resolve the issue.

ArchangeGabriel commented 2 years ago

@AGindinson Since you’ve closed that PR, what’s the plan now? This is blocking llvm upgrade in Arch, so if a temporary work-around could be implemented in IGC while the deeper solution is being investigated, that would help us a lot. ;)

AGindinson commented 2 years ago

@AGindinson Since you’ve closed that PR, what’s the plan now? This is blocking llvm upgrade in Arch, so if a temporary work-around could be implemented in IGC while the deeper solution is being investigated, that would help us a lot. ;)

@ArchangeGabriel Sorry for holding off the update on the matter. I've already prepared an IGC-level workaround and will link the commit once it's merged.

ArchangeGabriel commented 2 years ago

No worry! It’s a good thing that you’re trying to fix things properly, but it’s great too that there is a coming workaround. :)

ArchangeGabriel commented 2 years ago

@AGindinson Can confirm it fixes this issue. :) There is another one still crashing compute-runtime test suite for which @foutrelis is going to open a new issue soon once he will have the backtrace, but for darktable the current fix seems enough (also cc @frantisekz for https://bugzilla.redhat.com/show_bug.cgi?id=2075944) and will allow us to move on with llvm14 –we lived with tests disabled in compute-runtime because of #204 for ~8 months, so not a new situation.

frantisekz commented 2 years ago

Thanks for ping @ArchangeGabriel ; already on the way to the repositories ( https://bodhi.fedoraproject.org/updates/FEDORA-2022-c3e3ae48a9 ) !

AGindinson commented 2 years ago

@ArchangeGabriel @foutrelis @frantisekz FYI, there have been further changes in the handling of kernel metadata. My initial fix for this issue resulted in SPIR-V execution mode information loss with LLVM 14 (e.g. breaking intel_reqd_work_group_size attribute support). The approach has been reworked within https://github.com/intel/intel-graphics-compiler/commit/6a13fa903f380e17378286a7cd43995b0ae162ad - in case LLVM 14 yields runtime failures for some of the tests, retesting with this new commit may give better results.

foutrelis commented 2 years ago

@AGindinson Thanks for highlighting subsequent relevant fixes. We definitely want to include them in our IGC package which is built against LLVM 14.

in case LLVM 14 yields runtime failures for some of the tests, retesting with this new commit may give better results

I can still repro #250 (for what it's worth 🐭️).

foutrelis commented 2 years ago

Correction: ocloc doesn't crash when building compute-runtime with current IGC master. I have updated #250 accordingly. The single test failure I'm now seeing may or may not be related to this (closed) issue.