Closed jiegec closed 5 months ago
A part of it is https://gcc.gnu.org/PR112919.
Clang 17 is even worse 😢
$ ./coremark-clang17.exe 0x0 0x0 0x66 0 7 1 2000
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 16841
Total time (secs): 16.841000
Iterations/Sec : 17813.669022
Iterations : 300000
Compiler version : Clang 17.0.6
Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 17813.669022 / Clang 17.0.6 -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
$ ./coremark-gcc14.exe 0x0 0x0 0x66 0 7 1 2000
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 16387
Total time (secs): 16.387000
Iterations/Sec : 18307.194728
Iterations : 300000
Compiler version : GCC14.0.0 20231203 (experimental)
Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 18307.194728 / GCC14.0.0 20231203 (experimental) -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
$ ./coremark-gcc13.exe 0x0 0x0 0x66 0 7 1 2000
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 14150
Total time (secs): 14.150000
Iterations/Sec : 21201.413428
Iterations : 300000
Compiler version : GCC13.2.1 20231014
Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 21201.413428 / GCC13.2.1 20231014 -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
Let me do some experiments.
A part of it is https://gcc.gnu.org/PR112919.
The other part(s) of the regression seems caused by generic code. I've made a Git branch with all LoongArch target code change since 13.1 release stripped from trunk, and I can only get 19602.
FWIW there are some GCC 14 performance regression bugs open in the GCC Bugzilla.
The crcu8 function in coremark:
ee_u16 crcu8(ee_u8 data, ee_u16 crc) {
ee_u8 i = 0, x16 = 0, carry = 0;
for (i = 0; i < 8; i++) {
x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
data >>= 1;
if (x16 == 1) {
crc ^= 0x4002;
carry = 1;
} else
carry = 0;
crc >>= 1;
if (carry)
crc |= 0x8000;
else
crc &= 0x7fff;
}
return crc;
}
Compiles to the following assembly using GCC 13:
000000000000097c <_Z5crcu8ht>:
97c: 15ffff4f lu12i.w $t3, -6
980: 0015008e move $t2, $a0
984: 0280200d li.w $t1, 8
988: 001500a4 move $a0, $a1
98c: 038005ef ori $t3, $t3, 0x1
990: 0015b88c xor $t0, $a0, $t2
994: 0340058c andi $t0, $t0, 0x1
998: 0011300c sub.w $t0, $zero, $t0
99c: 00448484 srli.w $a0, $a0, 0x1
9a0: 0014bd8c and $t0, $t0, $t3
9a4: 02bffdad addi.w $t1, $t1, -1
9a8: 0015918c xor $t0, $t0, $a0
9ac: 006781ad bstrpick.w $t1, $t1, 0x7, 0x0
9b0: 004505ce srli.d $t2, $t2, 0x1
9b4: 006f8184 bstrpick.w $a0, $t0, 0xf, 0x0
9b8: 47ffd9bf bnez $t1, -40 # 990 <_Z5crcu8ht+0x14>
9bc: 4c000020 ret
using GCC14:
0000000000000a00 <_Z5crcu8ht>:
a00: 15ffff4f lu12i.w $t3, -6
a04: 0015008e move $t2, $a0
a08: 0280200d li.w $t1, 8
a0c: 038005ef ori $t3, $t3, 0x1
a10: 001500a4 move $a0, $a1
a14: 03400000 nop
a18: 03400000 nop
a1c: 03400000 nop
a20: 0015b88c xor $t0, $a0, $t2
a24: 0340058c andi $t0, $t0, 0x1
a28: 001c3d8c mul.w $t0, $t0, $t3
a2c: 00448484 srli.w $a0, $a0, 0x1
a30: 02bffdad addi.w $t1, $t1, -1
a34: 006781ad bstrpick.w $t1, $t1, 0x7, 0x0
a38: 004505ce srli.d $t2, $t2, 0x1
a3c: 0015918c xor $t0, $t0, $a0
a40: 006f8184 bstrpick.w $a0, $t0, 0xf, 0x0
a44: 47ffddbf bnez $t1, -36 # a20 <_Z5crcu8ht+0x20>
a48: 4c000020 ret
Comparing the two assemblies:
# GCC 13:
998: 0011300c sub.w $t0, $zero, $t0
9a0: 0014bd8c and $t0, $t0, $t3
# GCC 14:
a28: 001c3d8c mul.w $t0, $t0, $t3
$t3
equals to 0xffffffffffffa001(-24575). GCC merges crc ^= 0x4002, crc >>= 1, crc |= 0x8000
into crc >>= 1, crc ^= 0xa001
, 0xa001=40961. It can be further optimized into crc ^= x16 * 40961
.
I guess the problem lies in the synthesis of multiplication. I have also observed that num * 3
is not optimized into alsl
.
Performance in crcu8: GCC 13 is 37% faster than GCC 14.
Submitted as https://gcc.gnu.org/PR112935.
Rewriting crcu8 function to simplify the xor
in code manually and removing extra alignment recovered the performance regression:
ee_u16
crcu8(ee_u8 data, ee_u16 crc)
{
ee_u8 i = 0, x16 = 0, carry = 0;
for (i = 0; i < 8; i++)
{
x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
data >>= 1;
crc >>= 1;
if (x16 == 1)
{
crc ^= 0xa001;
}
}
return crc;
}
Performance numbers:
With LoongArch: Fix instruction costs and gimple_zero_one_valued_p patches applied:
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 15199
Total time (secs): 15.199000
Iterations/Sec : 19738.140667
Iterations : 300000
Compiler version : GCC14.0.0 20231210 (experimental)
Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 19738.140667 / GCC14.0.0 20231210 (experimental) -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
Additionally, pass -falign-labels=1
:
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 14071
Total time (secs): 14.071000
Iterations/Sec : 21320.446308
Iterations : 300000
Compiler version : GCC14.0.0 20231210 (experimental)
Compiler flags : -O2 -falign-labels=1 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 21320.446308 / GCC14.0.0 20231210 (experimental) -O2 -falign-labels=1 -DPERFORMANCE_RUN=1 -lrt / Heap
Note on clang/LLVM vs gcc performance, at one point, GCC does more jump threading when it comes to the state machine code. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54742 for that. I am not 100% sure if that is still correct though.
With https://gcc.gnu.org/g:8f0ff6b998748f3581e0f06e3108193866b1209d applied, I get 23706 points with the default -mtune and 23713 points with -mtune=la664. It's at 2.7 GHz so assuming a linear correlation the mark should be approximately 21950 at 2.5 GHz.
@xen0n closing this as resolved?
@xen0n closing this as resolved?
Feel free! (I've confirmed the results on my 3A6000 box earlier this month and the difference with your results adjusted to 2.5GHz is negligible.)
Observed coremark v1.01 performance regression by 15% between GCC 13.2.1 (20231014) and GCC 14.0.0 (20231203) experimental:
GCC 13.2.1:
GCC 14.0.0 snapshot (make run CC="gcc-14"):
Tested on Gentoo Linux. Binutils: 2.41.0 p2. Glibc: 2.38-r7 with LSX/LASX patches.
In case anyone wants to reproduce the results, download the executables from coremark.tar.gz.