llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.22k stars 12.06k forks source link

memset with length 2^N where N=2..7 is vectorized even with -Oz enabled #51196

Open dseredkin opened 3 years ago

dseredkin commented 3 years ago
Bugzilla Link 51854
Version trunk
OS Windows NT
CC @topperc,@RKSimon,@phoebewang,@rotateright

Extended Description

Memset is vectorized with flags -Oz and -Os when the length is equal to 2^N where N=2..7. There is no such behaviour in gcc, for example. I guess, it is okay to vectorize this code with O3, but for Oz this shouldn't be done.

Source: void func(int *P) { memset(P, 0, 128); }

Clang's output with Oz (trunk, https://godbolt.org/z/a6vjjxKhz): func(int, int): # @​func(int, int) xorps xmm0, xmm0 movups xmmword ptr [rdi + 112], xmm0 movups xmmword ptr [rdi + 96], xmm0 movups xmmword ptr [rdi + 80], xmm0 movups xmmword ptr [rdi + 64], xmm0 movups xmmword ptr [rdi + 48], xmm0 movups xmmword ptr [rdi + 32], xmm0 movups xmmword ptr [rdi + 16], xmm0 movups xmmword ptr [rdi], xmm0 ret

If length > 128 with Oz/Os, then we generate this: func(int, int): # @​func(int, int) mov edx, 256 xor esi, esi jmp memset@PLT # TAILCALL

For gcc with Os the output is the same for any length (see https://godbolt.org/z/1shqe319r): func(int*, int): mov ecx, X <-- X is the length xor eax, eax rep stosd ret

So we expect that with Os and Oz flags we don't vectorize and generate the same code as for the case with length > 128

rotateright commented 3 years ago

The IR is the expected memset, so this is a backend/expansion problem.

We actually try pretty hard to get this right (although I don't think x86 ever tries overriding the generic expansion by using "rep stos").

We have these knobs: MaxStoresPerMemset = 16; // For @​llvm.memset -> sequence of stores MaxStoresPerMemsetOptSize = 8;

So the 128-byte limit is not fixed. For example, if you compile with "-Os -mavx2", you should see the call expanded with bigger stores, so we go up to 256-bytes using vector stores.

We do not seem to differentiate between -Os and -Oz, so that's a potential enhancement.

There may be different platform expectations about what that does though - I made a minor fix here a long time ago: https://reviews.llvm.org/D11568