In the code mentioned in the Godbolt link it can be seen that clang generates four ldrb to load W[0]. In comparison, GCC 12.1 generates one ldr for the same test case.
The reason for the unoptimized code is the simplification of the arguments of fshl. The LLVM IR of the test case is in this Godbolt link . The IR of interest is:
In the code mentioned in the Godbolt link it can be seen that clang generates four
ldrb
to loadW[0]
. In comparison, GCC 12.1 generates oneldr
for the same test case.The reason for the unoptimized code is the simplification of the arguments of
fshl
. The LLVM IR of the test case is in this Godbolt link . The IR of interest is:Instead, if we change the first arguments of
fshl
to%or10
, the optimal assembly is generated.Test case
LLVM IR for the test case
Assembly generated by armv8-a clang for the test case
Assembly generated after making both arguments of
fshl
as same