loongson-community / discussions

Cross-community issue tracker & discussions / 跨社区工单追踪 & 讨论场所
9 stars 0 forks source link

[GCC14] Coremark performance regression by 15% on 3A6000 #23

Closed jiegec closed 5 months ago

jiegec commented 10 months ago

Observed coremark v1.01 performance regression by 15% between GCC 13.2.1 (20231014) and GCC 14.0.0 (20231203) experimental:

GCC 13.2.1:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 14287
Total time (secs): 14.287000
Iterations/Sec   : 20998.110170
Iterations       : 300000
Compiler version : GCC13.2.1 20231014
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 20998.110170 / GCC13.2.1 20231014 -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

GCC 14.0.0 snapshot (make run CC="gcc-14"):

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 16387
Total time (secs): 16.387000
Iterations/Sec   : 18307.194728
Iterations       : 300000
Compiler version : GCC14.0.0 20231203 (experimental)
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 18307.194728 / GCC14.0.0 20231203 (experimental) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

Tested on Gentoo Linux. Binutils: 2.41.0 p2. Glibc: 2.38-r7 with LSX/LASX patches.

In case anyone wants to reproduce the results, download the executables from coremark.tar.gz.

xry111 commented 10 months ago

A part of it is https://gcc.gnu.org/PR112919.

jiegec commented 10 months ago

Clang 17 is even worse 😢

$ ./coremark-clang17.exe 0x0 0x0 0x66 0 7 1 2000
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 16841
Total time (secs): 16.841000
Iterations/Sec   : 17813.669022
Iterations       : 300000
Compiler version : Clang 17.0.6
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 17813.669022 / Clang 17.0.6 -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
$ ./coremark-gcc14.exe  0x0 0x0 0x66 0 7 1 2000
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 16387
Total time (secs): 16.387000
Iterations/Sec   : 18307.194728
Iterations       : 300000
Compiler version : GCC14.0.0 20231203 (experimental)
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 18307.194728 / GCC14.0.0 20231203 (experimental) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
$ ./coremark-gcc13.exe  0x0 0x0 0x66 0 7 1 2000
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 14150
Total time (secs): 14.150000
Iterations/Sec   : 21201.413428
Iterations       : 300000
Compiler version : GCC13.2.1 20231014
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 21201.413428 / GCC13.2.1 20231014 -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

Let me do some experiments.

xry111 commented 10 months ago

A part of it is https://gcc.gnu.org/PR112919.

The other part(s) of the regression seems caused by generic code. I've made a Git branch with all LoongArch target code change since 13.1 release stripped from trunk, and I can only get 19602.

FWIW there are some GCC 14 performance regression bugs open in the GCC Bugzilla.

jiegec commented 10 months ago

The crcu8 function in coremark:

ee_u16 crcu8(ee_u8 data, ee_u16 crc) {
  ee_u8 i = 0, x16 = 0, carry = 0;

  for (i = 0; i < 8; i++) {
    x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
    data >>= 1;

    if (x16 == 1) {
      crc ^= 0x4002;
      carry = 1;
    } else
      carry = 0;
    crc >>= 1;
    if (carry)
      crc |= 0x8000;
    else
      crc &= 0x7fff;
  }
  return crc;
}

Compiles to the following assembly using GCC 13:

000000000000097c <_Z5crcu8ht>:
 97c:   15ffff4f    lu12i.w         $t3, -6
 980:   0015008e    move            $t2, $a0
 984:   0280200d    li.w            $t1, 8
 988:   001500a4    move            $a0, $a1
 98c:   038005ef    ori             $t3, $t3, 0x1
 990:   0015b88c    xor             $t0, $a0, $t2
 994:   0340058c    andi            $t0, $t0, 0x1
 998:   0011300c    sub.w           $t0, $zero, $t0
 99c:   00448484    srli.w          $a0, $a0, 0x1
 9a0:   0014bd8c    and             $t0, $t0, $t3
 9a4:   02bffdad    addi.w          $t1, $t1, -1
 9a8:   0015918c    xor             $t0, $t0, $a0
 9ac:   006781ad    bstrpick.w      $t1, $t1, 0x7, 0x0
 9b0:   004505ce    srli.d          $t2, $t2, 0x1
 9b4:   006f8184    bstrpick.w      $a0, $t0, 0xf, 0x0
 9b8:   47ffd9bf    bnez            $t1, -40    # 990 <_Z5crcu8ht+0x14>
 9bc:   4c000020    ret         

using GCC14:

0000000000000a00 <_Z5crcu8ht>:
 a00:   15ffff4f    lu12i.w         $t3, -6
 a04:   0015008e    move            $t2, $a0
 a08:   0280200d    li.w            $t1, 8
 a0c:   038005ef    ori             $t3, $t3, 0x1
 a10:   001500a4    move            $a0, $a1
 a14:   03400000    nop         
 a18:   03400000    nop         
 a1c:   03400000    nop         
 a20:   0015b88c    xor             $t0, $a0, $t2
 a24:   0340058c    andi            $t0, $t0, 0x1
 a28:   001c3d8c    mul.w           $t0, $t0, $t3
 a2c:   00448484    srli.w          $a0, $a0, 0x1
 a30:   02bffdad    addi.w          $t1, $t1, -1
 a34:   006781ad    bstrpick.w      $t1, $t1, 0x7, 0x0
 a38:   004505ce    srli.d          $t2, $t2, 0x1
 a3c:   0015918c    xor             $t0, $t0, $a0
 a40:   006f8184    bstrpick.w      $a0, $t0, 0xf, 0x0
 a44:   47ffddbf    bnez            $t1, -36    # a20 <_Z5crcu8ht+0x20>
 a48:   4c000020    ret         

Comparing the two assemblies:

  1. Extra nops generate by GCC 14 due to large code alignments, but it's fine because it is out of the loop.
  2. GCC 14 generates slow mul.w instead of sub + and operations:
# GCC 13:
 998:   0011300c    sub.w           $t0, $zero, $t0
 9a0:   0014bd8c    and             $t0, $t0, $t3

# GCC 14:
 a28:   001c3d8c    mul.w           $t0, $t0, $t3

$t3 equals to 0xffffffffffffa001(-24575). GCC merges crc ^= 0x4002, crc >>= 1, crc |= 0x8000 into crc >>= 1, crc ^= 0xa001, 0xa001=40961. It can be further optimized into crc ^= x16 * 40961.

I guess the problem lies in the synthesis of multiplication. I have also observed that num * 3 is not optimized into alsl.

Performance in crcu8: GCC 13 is 37% faster than GCC 14.

xry111 commented 10 months ago

Submitted as https://gcc.gnu.org/PR112935.

jiegec commented 10 months ago

Rewriting crcu8 function to simplify the xor in code manually and removing extra alignment recovered the performance regression:

ee_u16
crcu8(ee_u8 data, ee_u16 crc)
{
    ee_u8 i = 0, x16 = 0, carry = 0;

    for (i = 0; i < 8; i++)
    {
        x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
        data >>= 1;

        crc >>= 1;
        if (x16 == 1)
        {
            crc ^= 0xa001;
        }
    }
    return crc;
}

Performance numbers:

  1. GCC 13: 21175
  2. GCC 14: 19678
  3. GCC 14 with -falign-labels=1: 21349
jiegec commented 10 months ago

With LoongArch: Fix instruction costs and gimple_zero_one_valued_p patches applied:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 15199
Total time (secs): 15.199000
Iterations/Sec   : 19738.140667
Iterations       : 300000
Compiler version : GCC14.0.0 20231210 (experimental)
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 19738.140667 / GCC14.0.0 20231210 (experimental) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

Additionally, pass -falign-labels=1:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 14071
Total time (secs): 14.071000
Iterations/Sec   : 21320.446308
Iterations       : 300000
Compiler version : GCC14.0.0 20231210 (experimental)
Compiler flags   : -O2 -falign-labels=1 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 21320.446308 / GCC14.0.0 20231210 (experimental) -O2 -falign-labels=1 -DPERFORMANCE_RUN=1  -lrt / Heap
pinskia commented 10 months ago

Note on clang/LLVM vs gcc performance, at one point, GCC does more jump threading when it comes to the state machine code. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54742 for that. I am not 100% sure if that is still correct though.

xry111 commented 6 months ago

With https://gcc.gnu.org/g:8f0ff6b998748f3581e0f06e3108193866b1209d applied, I get 23706 points with the default -mtune and 23713 points with -mtune=la664. It's at 2.7 GHz so assuming a linear correlation the mark should be approximately 21950 at 2.5 GHz.

xry111 commented 5 months ago

@xen0n closing this as resolved?

xen0n commented 5 months ago

@xen0n closing this as resolved?

Feel free! (I've confirmed the results on my 3A6000 box earlier this month and the difference with your results adjusted to 2.5GHz is negligible.)