memset with length 2^N where N=2..7 is vectorized even with -Oz enabled

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

Other

29.22k stars 12.06k forks source link


Bugzilla Link	51854
Version	trunk
OS	Windows NT
CC	@topperc,@RKSimon,@phoebewang,@rotateright

Extended Description

Memset is vectorized with flags -Oz and -Os when the length is equal to 2^N where N=2..7. There is no such behaviour in gcc, for example. I guess, it is okay to vectorize this code with O3, but for Oz this shouldn't be done.

Source: void func(int *P) { memset(P, 0, 128); }

Clang's output with Oz (trunk, https://godbolt.org/z/a6vjjxKhz): func(int, int): # @func(int, int) xorps xmm0, xmm0 movups xmmword ptr [rdi + 112], xmm0 movups xmmword ptr [rdi + 96], xmm0 movups xmmword ptr [rdi + 80], xmm0 movups xmmword ptr [rdi + 64], xmm0 movups xmmword ptr [rdi + 48], xmm0 movups xmmword ptr [rdi + 32], xmm0 movups xmmword ptr [rdi + 16], xmm0 movups xmmword ptr [rdi], xmm0 ret

If length > 128 with Oz/Os, then we generate this: func(int, int): # @func(int, int) mov edx, 256 xor esi, esi jmp memset@PLT # TAILCALL

For gcc with Os the output is the same for any length (see https://godbolt.org/z/1shqe319r): func(int*, int): mov ecx, X <-- X is the length xor eax, eax rep stosd ret

So we expect that with Os and Oz flags we don't vectorize and generate the same code as for the case with length > 128

The IR is the expected memset, so this is a backend/expansion problem.

We actually try pretty hard to get this right (although I don't think x86 ever tries overriding the generic expansion by using "rep stos").

We have these knobs: MaxStoresPerMemset = 16; // For @llvm.memset -> sequence of stores MaxStoresPerMemsetOptSize = 8;

So the 128-byte limit is not fixed. For example, if you compile with "-Os -mavx2", you should see the call expanded with bigger stores, so we go up to 256-bytes using vector stores.

We do not seem to differentiate between -Os and -Oz, so that's a potential enhancement.

There may be different platform expectations about what that does though - I made a minor fix here a long time ago: https://reviews.llvm.org/D11568

llvm / llvm-project

memset with length 2^N where N=2..7 is vectorized even with -Oz enabled #51196

Extended Description