Another `cmov_int16` proposal

This proposal does not conditionally load a value, which could be less future proof. Indeed, having only one "virtual branch" loading a value is what could cause a compiler to create a branch, as it can avoid a load from memory given a condition - which is usually considered as a profitable optimization.

It's not only a speculation as this is what seemed to have originally happened with clang 18. Let's indeed look at clang's optimization pipeline using compiler explorer's very nice viewer on the original poly_frommsg function (https://godbolt.org/z/eYrM3scbe):

before the X86 cmov conversion pass, we actually have "virtual" cmov instructions:
after the X86 cmov conversion pass, we have the introduction of branches:

Let's look at what this pass is doing: https://github.com/llvm/llvm-project/blob/768118d1ad38bf13c545828f67bd6b474d61fc55/llvm/lib/Target/X86/X86CmovConversion.cpp#L10-L31. I think we hit all the conditions in that case:

we have cmovs in a loop
it is considered as profitable to do the transformation, because I think it manages to save a mov to a register (see the comments in checkForProfitableCmovCandidates: https://github.com/llvm/llvm-project/blob/768118d1ad38bf13c545828f67bd6b474d61fc55/llvm/lib/Target/X86/X86CmovConversion.cpp#L421-L443 that give more details). It would be interesting to debug that analysis to really understand what is considered more profitable here, as the generated assembly doesn't look very efficient to me:

What we can note though is that if it figures out one of the "branch" of the cmov has a slower latency than the other, it will generate a branch. This is why I think the original patch is asking for troubles, and that we should avoid exposing such patterns.

As to when this has been introduced in Clang, for what I've checked clang 15 is already doing that optimization, so it's been "a while". Checking LLVM's git log could help answer that question more precisely if that's of interest.

This patch also defines the cmov_int16 function as noinline. Indeed, Link-Time Optimization (LTO) could inline that function: the compiler could then realize the select argument is just one bit, which could then generate a select from the LLVM IR point of view, and we could then have the x86 cmov conversion pass be tempted to generate a branch again.

As an idea for the future, we could verify codegen in a test by verifying that the code generated by the compiler does not contain any branch. This could be done by using LLVM's FileCheck utility - which is basically an enhanced grep that can reason with context. The test would basically check that in the general textual assembly, we don't have any test or branch instructions (depending on the targeted architecture). We could also check that on the LLVM IR, but: 1) that would only work for clang-based compilers 2) LLVM's codegen is the one generating a branch here (as stated above), so we wouldn't catch the issue.

This test could be run only if that tool is available in the environment, so that we don't make the build system more complex for everyone. I can make a first tentative if that's something of interest to the maintainers!

pq-crystals / kyber

Another `cmov_int16` proposal #78