Unrolling loops leads to sub-optimal code generation in ARM Cortex A72

llvmbot commented 4 years ago


Bugzilla Link	46999
Version	10.0
OS	Linux
Attachments	string copy source code
Reporter	LLVM Bugzilla Contributor
CC	@efriedma-quic,@zygoloid

Extended Description

Summary: LLVM 10.0.0.1 unrolls two loops in a string copy function which leads to unaligned memory access when -mcpu= cortex-a72 (ARM) option is specified during compilation. When this option is not specified, the compiler does not unroll the loops and the code gen does not contain any unaligned accesses.

Build options and other details for reproduction: LLVM: clang version 10.0.0.1 Arch : ARM cortex-a72 Optimization options used : -fno-builtin --target=arm64 -mcpu= cortex-a72 -ffixed-x18 -std=c11 -nostdlibinc -nostdinc++ -ftls-model=local-exec -fno-builtin -fno-strict-aliasing -mno-implicit-float -O2 -w ( The above list of build options have been added as they are specific to our workload and are required to be used during compilation)

Source code and code gen : The source is a basic strncpy function and is attached in this bug. The following is the assembly generated when compiled with -mcpu=cortex-a72 option :

    .p2align        4
    .type   strncpy,@function

strncpy: cbz w2, .LBB0_10 mov x8, x0 .LBB0_2:
ldrb w9, [x1] cbz w9, .LBB0_4 add x1, x1, #1
subs w2, w2, #1 strb w9, [x8], #1 b.ne .LBB0_2 b .LBB0_10 .LBB0_4:
sub w9, w2, #1 tst w2, #0x3 b.eq .LBB0_8 mov w10, wzr and w11, w2, #0x3 .LBB0_6:
strb wzr, [x8], #1 add w10, w10, #1 cmp w11, w10 b.ne .LBB0_6 sub w2, w2, w10 .LBB0_8:
cmp w9, #3
b.lo .LBB0_10 .LBB0_9:
subs w2, w2, #4
str wzr, [x8], #4 b.ne .LBB0_9 .LBB0_10:
ret .Lfunc_end0: .size strncpy, .Lfunc_end0-strncpy

From the assembly sequence above, it can be noticed that there is a 4B store “str wzr, [x8], #4” which could be to an unaligned memory location.

Without the -mcpu=cortex-a72 option the compiler generates the following assembly sequence : .p2align 2 .type strncpy,@function strncpy:
cbz w2, .LBB0_5 mov x8, x0 .LBB0_2:
ldrb w9, [x1] cbz w9, .LBB0_4 add x1, x1, #1
subs w2, w2, #1
strb w9, [x8], #1 b.ne .LBB0_2 b .LBB0_5 .LBB0_4:
add w9, w9, #1
cmp w2, w9 strb wzr, [x8], #1 b.ne .LBB0_4 .LBB0_5:
ret

Observations: After some debugging with the unroll pass in LLVM, I notice that the -mcpu=cortex-a72 option uses the model file for arm cortex A-57 which in turn overrides some default values (for ARM generic) related to loop buffer. The case where the cortex-a72 option is not used, it uses the default value for “LoopMicroOpBufferSize” (which is 0) in getUnrollingPreferences() function in BasicTTIImpl.h. With the -mcpu=cortex-a72 option, the value of "LoopMicroOpBufferSize" is overridden by 16 because of which eventually in the function, the variable "UP.Runtime" is set to True and the loop gets unrolled. As the value for “LoopMicroOpBufferSize” is 0 for the case without the cortex-a72 option, it returns control to the "LoopUnrollPass.cpp" where the default value for "UP.Runtime" is False and hence the loop does not get unrolled.

Possible Solution(s): Disabling loop unrolling with the -mcpu=cortex-a72 option results in no unrolling and the assembly resembles that of the case without this option.

However, we would like to know if some setting changes could be possible regarding the default “LoopMicroOpBufferSize” for cortex-a72 specifically? Or any other work around that can be done in the LLVM source instead of explicitly using flags or other options?

llvmbot commented 4 years ago

@Eli ,thank you for the suggestions ,what about code size ?

efriedma-quic commented 4 years ago

If you're targeting some environment where unaligned access doesn't work for whatever reason (e.g. the cache is disabled), you can pass -mno-unaligned-access to clang.

Otherwise, I'm not sure why you think the transform is wrong; unaligned accesses are generally pretty fast on a cortex-a72.

llvm / llvm-project

Unrolling loops leads to sub-optimal code generation in ARM Cortex A72 #46343

Extended Description