Open llvmbot opened 3 years ago
It seems likely that the change https://reviews.llvm.org/D85604 (which is part of the proposal Matt mentioned) will fix this -- at least, if __activemask is marked convergent.
In general modeling this correctly requires this proposal: http://lists.llvm.org/pipermail/llvm-dev/2020-August/144165.html
Extended Description
CUDA has an intrinsic called
__activemask()
that populates a 32-bit variable with a bitmask indicating which threads are executing the current instruction. This is sensitive to branching behavior; in CUDA’s SIMT model, when threads diverge at a branch, one side of the branch will be executed (with other threads being masked off), and then the other branch path will be executed. If 32 threads execute the following code, the first thread should print a bitmask containing only that thread, and the others should print a bitmask containing all other threads:Correct output (compiled using nvcc -O3):
Incorrect output (compiled using clang-10 -O1):
I am compiling using the following invocation:
Before running SimplifyCFGPass, the IR has an __activemask() call separately in each branch:
After SimplifyCFGPass, activemask has been hoisted to execute before any branch:
The same behavior happens when trying to use inline assembly instead of a function call. There seems to be no way to indicate to the compiler that we do not want the activemask instruction to be reordered around branches, and specifying memory and control code clobbers does not prevent this behavior:
This hoisting optimization is safe for all CPU instructions, but isn’t necessarily safe for the SIMT model of execution, and it seems that there is no way to denote instructions or function calls that should not be hoisted. Maybe this behavior could be disabled when compiling for ptx, or a new attribute could be added to mark branch-dependent code.
Our current workaround is to replace __activemask() with a macro that uses an opaquely defined structure, whose name depends on the current source line number, to ensure that the types of each inline assembly block are unique, and thus not subject to merging.
However, this is not optimal, particularly if used in a short function that gets inlined into other branchy code.
This problem is related to bug #35249. Apologies if this is considered a duplicate, but we decided to file a new bug since that ticket was primarily focused on a different (resolved) issue, and we are adding new information.