The induction variable has been vectorized with a VF of 16, but the body of the loop is done with lots of repeated scalar code. The code emitted at -O2 and -O3 is much smaller.
This affects at least --target=aarch64-none-elf, --target=armv8a-none-eabihf and --target=x86_64-unknown-linux-gnu.
This was found by a fuzzer (because a variation on it triggered an unrelated miscompilation in the ARM backend), not real-world code, but I think the code looks "normal" enough to consider this a missed optimisation bug.
The loop vectorizer can generate very inefficient IR for this code at
-Os
:The induction variable has been vectorized with a VF of 16, but the body of the loop is done with lots of repeated scalar code. The code emitted at
-O2
and-O3
is much smaller.This affects at least
--target=aarch64-none-elf
,--target=armv8a-none-eabihf
and--target=x86_64-unknown-linux-gnu
.This was found by a fuzzer (because a variation on it triggered an unrelated miscompilation in the ARM backend), not real-world code, but I think the code looks "normal" enough to consider this a missed optimisation bug.