Closed sdh4 closed 3 years ago
I can confirm that (after working around a few POCL bugs) the code executes on POCL without tripping valgrind. That means it is probably not a memory corruption problem from the surrounding code.
Separately, it appears that the errors seem to come in groups of 8 work items, on boundaries divisible by 8. I may be able to provide a test case, but I'm not sure how easy it will be to shrink it down to something simple.
This issue looks like problem on IGC side, so I'm transferring it to IGC project: https://github.com/intel/intel-graphics-compiler
I can confirm that this problem still occurs under Fedora 33 (intel-opencl-20.47.18513-1.fc33.x86_64; intel-igc-opencl-1.0.5585-1.fc33.x86_64; llvm-11.0.0-1.fc33.x86_64; clang-11.0.0-2.fc33.x86_64). Hardware is HD Graphics 620 (rev 02) (via lspci).
Any suggestions on environment variables/compilation flags to troubleshoot? Perhaps disabling certain forms of optimization?
I did eventually track this down and it seems to have originated from a library version mismatch with old manually compiled libraries being dynamically linked/loaded in place of the correct ones from the RPM's
I am running into a problem where a store operation in the generated code seems to be not happening like it ought to, leading to incorrect output.
Interestingly, using POCL (which is presumably based on the same clang/llvm) gives correct output.
This is on Fedora 32, intel-opencl-20.47.18513-1.fc32.x86_64 from the copr repository. clang-10.0.1-3 and llvm-10.0.1-4
I've surrounded the problematic line with prints, that illustrate the problem. In this case b[2] is being scaled, and reads correctly when accessed as b[pivots[row]] where pivots[row] is 2, but does not read correctly when accessed as b[2]. Later accesses to the same memory seem to indicate that the updated value was not stored, and reads back as the original value. Here is the code:
and here is the output running on NEO:
Output running on the CPU via POCL:
This looks like it could be quite tricky to troubleshoot. Its always possible that there is some nearby bug in my code causing undefined behavior. Is this worth trying to track down more deeply?