Open dseredkin opened 3 years ago
The IR is the expected memset, so this is a backend/expansion problem.
We actually try pretty hard to get this right (although I don't think x86 ever tries overriding the generic expansion by using "rep stos").
We have these knobs: MaxStoresPerMemset = 16; // For @llvm.memset -> sequence of stores MaxStoresPerMemsetOptSize = 8;
So the 128-byte limit is not fixed. For example, if you compile with "-Os -mavx2", you should see the call expanded with bigger stores, so we go up to 256-bytes using vector stores.
We do not seem to differentiate between -Os and -Oz, so that's a potential enhancement.
There may be different platform expectations about what that does though - I made a minor fix here a long time ago: https://reviews.llvm.org/D11568
Extended Description
Memset is vectorized with flags -Oz and -Os when the length is equal to 2^N where N=2..7. There is no such behaviour in gcc, for example. I guess, it is okay to vectorize this code with O3, but for Oz this shouldn't be done.
Source: void func(int *P) { memset(P, 0, 128); }
Clang's output with Oz (trunk, https://godbolt.org/z/a6vjjxKhz): func(int, int): # @func(int, int) xorps xmm0, xmm0 movups xmmword ptr [rdi + 112], xmm0 movups xmmword ptr [rdi + 96], xmm0 movups xmmword ptr [rdi + 80], xmm0 movups xmmword ptr [rdi + 64], xmm0 movups xmmword ptr [rdi + 48], xmm0 movups xmmword ptr [rdi + 32], xmm0 movups xmmword ptr [rdi + 16], xmm0 movups xmmword ptr [rdi], xmm0 ret
If length > 128 with Oz/Os, then we generate this: func(int, int): # @func(int, int) mov edx, 256 xor esi, esi jmp memset@PLT # TAILCALL
For gcc with Os the output is the same for any length (see https://godbolt.org/z/1shqe319r): func(int*, int): mov ecx, X <-- X is the length xor eax, eax rep stosd ret
So we expect that with Os and Oz flags we don't vectorize and generate the same code as for the case with length > 128