[clang] Poor performance because gather and scatter operations are emitted by the compiler targeting AVX512 or at level 3 of optimization

wolfpld commented 5 months ago

Using -march=native and/or -O3 compilation flags can result in a significantly (more that x2) slower executable.

The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h.

Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present.

To build the program and get the results, you need to do the following:

% meson setup build --buildtype=release
% cd build
% ninja
# any reasonably large image file will do instead of 16k.png, this only needs to be done once to get 16k.dds
% ./etcpak -c bc7 -h dds ~/16k.png 16k.dds
% ./etcpak -v -b 16k.dds 
Median decode time for 9 runs: 3483.195 ms (77.066 Mpx/s)

The 77 Mpx/s result is a measure of performance. The meson setup command configured the compiler to use -O3 and -march=native. I am running on i7-1185G7, which supports AVX512.

Building the program with meson setup build --buildtype=release --optimization=2, which lowers the optimization level to -O2 results in 2274.264 ms (118.032 Mpx/s), which is not what you would normally expect from lowering the optimization level.

Similarly, changing the -march=native parameter to -march=skylake (which requires modifying the meson.build file) results in 2776.365 ms (96.686 Mpx/s). The Skylake ISA doesn't support AVX512.

Interestingly, building with both -march=skylake and -O2 results in 1693.646 ms (158.496 Mpx/s). This is twice the speed of -march=native + -O3 build.

The behavior is also reproducible on Ryzen 7950X (Zen4, another AVX512-enabled uarch), where -march=native + -O3 results in 170 Mpx/s and -march=skylake + O3 results in 190 Mpx/s.

In the case of -march=native + -O3, the first problematic place in the code is lines 1171-1173. The compiler emits a lot of gather + scatter instructions that are microcoded and have big latency.

obraz

Here are example measurements for one of the gather instructions:

obraz

Another problematic place is at line 1225, where a series of gather operations is emitted (four gathers are emitted, but I only show one here).

obraz

With -march=native and -O2, the first place mentioned above still emits gathers and scatters, and takes a larger percentage of the runtime.

obraz

The second code fragment is now emitted as a series of scalar operations, which causes it to practically disappear from the list of hot spots.

obraz

The -march=skylake + -O3 configuration produces a number of AVX2 operations (since AVX512 is not supported by Skylake) that have virtually no impact on execution speed.

obraz

The second location dominates again due to a series of gather operations (oops, AVX2 has gathers and scatters too!).

obraz

This gather instruction is again heavily microcoded and has high latency.

obraz

With -march=skylake and -O2, gather operations are no longer emitted, moving hotspots in the code to scalar computation elsewhere in the code as expected.

Checked with clang version 17.0.6. I have also observed similar behavior with clang version 19.0.0git (https://github.com/llvm/llvm-project.git 35886dc63a2d024e20c10d2e1cb3f5fa5d9f72cc).

llvmbot commented 5 months ago

@llvm/issue-subscribers-backend-x86

Author: Bartosz Taudul (wolfpld)

Using `-march=native` and/or `-O3` compilation flags can result in a significantly (more that x2) slower executable. The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h. Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present. To build the program and get the results, you need to do the following: ```sh % meson setup build --buildtype=release % cd build % ninja # any reasonably large image file will do instead of 16k.png, this only needs to be done once to get 16k.dds % ./etcpak -c bc7 -h dds ~/16k.png 16k.dds % ./etcpak -v -b 16k.dds Median decode time for 9 runs: 3483.195 ms (77.066 Mpx/s) ``` The 77 Mpx/s result is a measure of performance. The meson setup command configured the compiler to use `-O3` and `-march=native`. I am running on i7-1185G7, which supports AVX512. Building the program with `meson setup build --buildtype=release --optimization=2`, which lowers the optimization level to `-O2` results in 2274.264 ms (118.032 Mpx/s), which is not what you would normally expect from lowering the optimization level. Similarly, changing the `-march=native` parameter to `-march=skylake` (which requires modifying the `meson.build` file) results in 2776.365 ms (96.686 Mpx/s). The Skylake ISA doesn't support AVX512. Interestingly, building with both `-march=skylake` and `-O2` results in 1693.646 ms (158.496 Mpx/s). This is twice the speed of `-march=native` + `-O3` build. The behavior is also reproducible on Ryzen 7950X (Zen4, another AVX512-enabled uarch), where `-march=native` + `-O3` results in 170 Mpx/s and `-march=skylake` + `O3` results in 190 Mpx/s. --- In the case of `-march=native` + `-O3`, the first problematic place in the code is lines 1171-1173. The compiler emits a lot of gather + scatter instructions that are microcoded and have big latency. ![obraz](https://github.com/llvm/llvm-project/assets/600573/854923d9-3dd0-4b7c-b98e-6dc2dc928989) Here are example measurements for one of the gather instructions: ![obraz](https://github.com/llvm/llvm-project/assets/600573/ee82f8ed-3759-4cb4-912d-476aff9bf58b) Another problematic place is at line 1225, where a series of gather operations is emitted (four gathers are emitted, but I only show one here). ![obraz](https://github.com/llvm/llvm-project/assets/600573/ee8ff4c5-f6ea-4cd1-8214-1b7c0dd2c799) --- With `-march=native` and `-O2`, the first place mentioned above still emits gathers and scatters, and takes a larger percentage of the runtime. ![obraz](https://github.com/llvm/llvm-project/assets/600573/47a54eee-dd54-4f6f-b730-e46451933981) The second code fragment is now emitted as a series of scalar operations, which causes it to practically disappear from the list of hot spots. ![obraz](https://github.com/llvm/llvm-project/assets/600573/43311f16-b8fb-4c46-96bd-d07b519ce6de) --- The `-march=skylake` + `-O3` configuration produces a number of AVX2 operations (since AVX512 is not supported by Skylake) that have virtually no impact on execution speed. ![obraz](https://github.com/llvm/llvm-project/assets/600573/fbe79deb-1036-46b0-8675-8d7e2d03e9b6) The second location dominates again due to a series of gather operations (oops, AVX2 has gathers and scatters too!). ![obraz](https://github.com/llvm/llvm-project/assets/600573/4af5a056-953a-4be8-9855-dbecd0667018) This gather instruction is again heavily microcoded and has high latency. ![obraz](https://github.com/llvm/llvm-project/assets/600573/a3196ca5-948e-4c4d-98af-d66709cee2db) --- With `-march=skylake` and `-O2`, gather operations are no longer emitted, moving hotspots in the code to scalar computation elsewhere in the code as expected. --- Checked with clang version 17.0.6. I have also observed similar behavior with clang version 19.0.0git (https://github.com/llvm/llvm-project.git 35886dc63a2d024e20c10d2e1cb3f5fa5d9f72cc).

dtcxzyw commented 5 months ago

cc @RKSimon

RKSimon commented 5 months ago

Looking at this now, but I'm going to have to think about where to start tbh - gather/scatters are still a mess in both the costs tables and the scheduler models (the znver4 model has no entries at all so they are modelled as simple loads/stores.....).

wolfpld commented 3 weeks ago

A bit simpler repro case:

#include <stdint.h>
#include <string.h>

void FixOrder( char* data, size_t blocks )
{
    do
    {
        uint32_t tmp;
        memcpy( &tmp, data+4, 4 );
        tmp = ~tmp;
        uint32_t t0 = tmp & 0x55555555;
        uint32_t t1 = tmp & 0xAAAAAAAA;
        tmp = ( ( t0 << 1 ) | ( t1 >> 1 ) ) ^ t1;
        memcpy( data+4, &tmp, 4 );
        data += 8;
    }
    while( --blocks );
}

The following assembly is generated with -O3 -march=znver4:

.LBB0_7:
  vmovdqu64 zmm3, zmmword ptr [r9 + 4]
  kxnorw k1, k0, k0
  add r8, -16
  vpermt2d zmm3, zmm0, zmmword ptr [r9 + 68]
  vpandnd zmm4, zmm3, zmm1
  vpternlogq zmm3, zmm3, zmm3, 15
  vpaddd zmm3, zmm3, zmm3
  vpsrld zmm5, zmm4, 1
  vpandd zmm3, zmm3, zmm1
  vpternlogd zmm5, zmm4, zmm3, 54
  vpscatterdd zmmword ptr [r9 + zmm2] {k1}, zmm5
  sub r9, -128
  cmp rax, r8
  jne .LBB0_7

llvm / llvm-project

[clang] Poor performance because gather and scatter operations are emitted by the compiler targeting AVX512 or at level 3 of optimization #87640