FFT and Sparse matmult in Scimark2 are slower with `-O2` & `-O3` than `-O1`

llvmbot commented 5 years ago


Bugzilla Link	43576
Version	9.0
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@davidbolvansky,@DougGregor,@RKSimon,@zygoloid

Extended Description

FFT and Sparse matmult in Scimark2(https://math.nist.gov/scimark2/download_c.html) are slower with -O2 & -O3 than -O1.

In gcc-9.2.0, everything is OK.

All the experiments are carried out in CentOS 7.6.1810 (Core)

╔════════════════════════════╤═════════════╤════════════════╤═══════════╗
║ make CC=clang-9/gcc \      │             │                │           ║
║ CFLAGS="-Ox -march=native” │  Mflops in  │    Comment     │ Mflops in ║
║                            │ clang-9.0.0 │                │ gcc-9.2.0 ║
║ ./scimark2                 │             │                │           ║
╠══════╤═════════════════════╪═════════════╪════════════════╪═══════════╣
║ -O0  │ Composite Score     │ 480.30      │                │ 435.11    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ FFT                 │ 356.03      │                │ 326.45    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ SOR                 │ 772.92      │                │ 768.71    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ MonteCarlo          │ 77.57       │                │ 92.10     ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ Sparse matmult      │ 459.64      │                │ 419.38    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ LU                  │ 735.37      │                │ 568.94    ║
╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢
║ -O1  │ Composite Score     │ 1494.63     │                │ 1451.81   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ FFT                 │ 1430.49     │                │ 1400.51   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ SOR                 │ 1117.54     │                │ 856.00    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ MonteCarlo          │ 439.39      │                │ 523.79    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ Sparse matmult      │ 2179.07     │                │ 2188.18   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ LU                  │ 2306.67     │                │ 2290.57   ║
╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢
║ -O2  │ Composite Score     │ 1743.93     │                │ 1700.95   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ FFT                 │ **1300.62** │ Slower than O1 │ 1618.03   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ SOR                 │ 1123.46     │                │ 1067.22   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ MonteCarlo          │ 440.73      │                │ 584.45    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ Sparse matmult      │ **1771.17** │ Slower than O1 │ 2446.48   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ LU                  │ 4083.67     │                │ 2788.59   ║
╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢
║ -O3  │ Composite Score     │ 1786.63     │                │ 2376.34   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ FFT                 │ **1304.30** │ Slower than O1 │ 1700.11   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ SOR                 │ 1128.26     │                │ 1540.28   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ MonteCarlo          │ 439.83      │                │ 587.74    ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ Sparse matmult      │ **1896.98** │ Slower than O1 │ 2443.62   ║
║      ├─────────────────────┼─────────────┼────────────────┼───────────╢
║      │ LU                  │ 4163.75     │                │ 5609.94   ║
╚══════╧═════════════════════╧═════════════╧════════════════╧═══════════╝

davidbolvansky commented 5 years ago

Looks like there are a multiple issues here

For LU, loop unroller regresses the benchmark performance a lot. Maybe you can reduce LU a bit and find the hot loop which is unrolled too much? -> New bug report.

-Ox -march=native results again show regressions. Probably SLP and cost model issue?

Can you try it with -Ox -march=native -fno-unroll-loops -fno-slp-vectorize ?

loopinterchange seems to improve things a bit, that's nice.

llvmbot commented 5 years ago

Thank you David.

Here is the result:

CFLAGS:

╔═══╦═════════════════════════════════════╗
║ 1 ║ -Ox -march=native                   ║
╟───╫─────────────────────────────────────╢
║ 2 ║ -Ox -march=native -fno-unroll-loops ║
╟───╫─────────────────────────────────────╢
║ 3 ║ -Ox                                 ║
╟───╫─────────────────────────────────────╢
║ 4 ║ -Ox -fno-unroll-loops               ║
╟───╫─────────────────────────────────────╢
║ 5 ║ -Ox -mllvm -enable-loopinterchange  ║
╚═══╩═════════════════════════════════════╝

Mflops:

╔═════╤══════╤═════════════╤═════════════╤═════════════╤═════════════╗
║     │ 1    │ 2           │ 3           │ 4           │ 5           ║
╠═════╪══════╪═════════════╪═════════════╪═════════════╪═════════════╣
║ -O1 │ 1494 │    1494     │    1498     │    1499     │ 1491        ║
║     ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║     │ 1430 │    1428     │    1414     │    1410     │ 1389        ║
║     ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║     │ 1117 │    1118     │    1117     │    1117     │ 1117        ║
║     ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║     │ 439  │     439     │     440     │     440     │ 438         ║
║     ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║     │ 2179 │    2180     │    2212     │    2217     │ 2204        ║
║     ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║     │ 2306 │    2304     │    2304     │    2310     │ 2309        ║
╟─────┼──────┼──────┬──────┼──────┬──────┼──────┬──────┼──────┬──────╢
║ -O2 │ 1743 │ 1962 │      │ 1834 │      │ 1748 │      │ 1827 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 1300 │ 1304 │ SLOW │ 1466 │      │ 1318 │ SLOW │ 1465 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 1123 │ 1117 │      │ 1125 │      │ 1113 │      │ 1124 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 440  │ 440  │      │ 440  │      │ 438  │      │ 440  │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 1771 │ 2041 │      │ 2210 │      │ 2185 │      │ 2196 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 4083 │ 4909 │      │ 3928 │      │ 3683 │      │ 3912 │      ║
╟─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ -O3 │ 1786 │ 1964 │      │ 1820 │      │ 1731 │      │ 1818 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 1304 │ 1298 │ SLOW │ 1455 │      │ 1417 │      │ 1451 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 1128 │ 1117 │      │ 1126 │      │ 1114 │      │ 1127 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 439  │ 439  │      │ 437  │      │ 439  │      │ 442  │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 1896 │ 2040 │      │ 2205 │      │ 2152 │      │ 2212 │      ║
║     ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║     │ 4163 │ 4928 │      │ 3875 │ SLOW │ 3533 │ SLOW │ 3858 │ SLOW ║
╚═════╧══════╧══════╧══════╧══════╧══════╧══════╧══════╧══════╧══════╝

It can be seen some trade-offs among these options, so maybe there is no best selection of options?

davidbolvansky commented 5 years ago

Can you measure without -march=native? I found some cases (spec) when -march=native regresses the performance (SLP vectorizer..)

Or try to measure with -fno-unroll-loops; sometimes LLVM unrolls a lot :/

gcc with -O3 is very good, (because they enable -floop-interchange with -O3?).

Try to measure again clang with clang -O3 -mllvm -enable-loopinterchange.

General question: Should bots somehow track the perf of test-suite benchmarks compiled with -march=native and without -march=native?

llvmbot commented 4 months ago

@llvm/issue-subscribers-backend-x86

Author: None (llvmbot)

| | | | --- | --- | | Bugzilla Link | [43576](https://llvm.org/bz43576) | | Version | 9.0 | | OS | All | | Reporter | LLVM Bugzilla Contributor | | CC | @davidbolvansky,@DougGregor,@RKSimon,@zygoloid | ## Extended Description FFT and Sparse matmult in Scimark2(https://math.nist.gov/scimark2/download_c.html) are slower with `-O2` & `-O3` than `-O1`. In gcc-9.2.0, everything is OK. All the experiments are carried out in CentOS 7.6.1810 (Core) ``` ╔════════════════════════════╤═════════════╤════════════════╤═══════════╗ ║ make CC=clang-9/gcc \ │ │ │ ║ ║ CFLAGS="-Ox -march=native” │ Mflops in │ Comment │ Mflops in ║ ║ │ clang-9.0.0 │ │ gcc-9.2.0 ║ ║ ./scimark2 │ │ │ ║ ╠══════╤═════════════════════╪═════════════╪════════════════╪═══════════╣ ║ -O0 │ Composite Score │ 480.30 │ │ 435.11 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ 356.03 │ │ 326.45 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 772.92 │ │ 768.71 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 77.57 │ │ 92.10 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ 459.64 │ │ 419.38 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 735.37 │ │ 568.94 ║ ╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢ ║ -O1 │ Composite Score │ 1494.63 │ │ 1451.81 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ 1430.49 │ │ 1400.51 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 1117.54 │ │ 856.00 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 439.39 │ │ 523.79 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ 2179.07 │ │ 2188.18 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 2306.67 │ │ 2290.57 ║ ╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢ ║ -O2 │ Composite Score │ 1743.93 │ │ 1700.95 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ **1300.62** │ Slower than O1 │ 1618.03 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 1123.46 │ │ 1067.22 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 440.73 │ │ 584.45 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ **1771.17** │ Slower than O1 │ 2446.48 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 4083.67 │ │ 2788.59 ║ ╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢ ║ -O3 │ Composite Score │ 1786.63 │ │ 2376.34 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ **1304.30** │ Slower than O1 │ 1700.11 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 1128.26 │ │ 1540.28 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 439.83 │ │ 587.74 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ **1896.98** │ Slower than O1 │ 2443.62 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 4163.75 │ │ 5609.94 ║ ╚══════╧═════════════════════╧═════════════╧════════════════╧═══════════╝ ```

llvmbot commented 4 months ago

@llvm/issue-subscribers-backend-x86

Author: None (llvmbot)

| | | | --- | --- | | Bugzilla Link | [43576](https://llvm.org/bz43576) | | Version | 9.0 | | OS | All | | Reporter | LLVM Bugzilla Contributor | | CC | @davidbolvansky,@DougGregor,@RKSimon,@zygoloid | ## Extended Description FFT and Sparse matmult in Scimark2(https://math.nist.gov/scimark2/download_c.html) are slower with `-O2` & `-O3` than `-O1`. In gcc-9.2.0, everything is OK. All the experiments are carried out in CentOS 7.6.1810 (Core) ``` ╔════════════════════════════╤═════════════╤════════════════╤═══════════╗ ║ make CC=clang-9/gcc \ │ │ │ ║ ║ CFLAGS="-Ox -march=native” │ Mflops in │ Comment │ Mflops in ║ ║ │ clang-9.0.0 │ │ gcc-9.2.0 ║ ║ ./scimark2 │ │ │ ║ ╠══════╤═════════════════════╪═════════════╪════════════════╪═══════════╣ ║ -O0 │ Composite Score │ 480.30 │ │ 435.11 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ 356.03 │ │ 326.45 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 772.92 │ │ 768.71 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 77.57 │ │ 92.10 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ 459.64 │ │ 419.38 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 735.37 │ │ 568.94 ║ ╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢ ║ -O1 │ Composite Score │ 1494.63 │ │ 1451.81 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ 1430.49 │ │ 1400.51 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 1117.54 │ │ 856.00 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 439.39 │ │ 523.79 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ 2179.07 │ │ 2188.18 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 2306.67 │ │ 2290.57 ║ ╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢ ║ -O2 │ Composite Score │ 1743.93 │ │ 1700.95 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ **1300.62** │ Slower than O1 │ 1618.03 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 1123.46 │ │ 1067.22 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 440.73 │ │ 584.45 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ **1771.17** │ Slower than O1 │ 2446.48 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 4083.67 │ │ 2788.59 ║ ╟──────┼─────────────────────┼─────────────┼────────────────┼───────────╢ ║ -O3 │ Composite Score │ 1786.63 │ │ 2376.34 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ FFT │ **1304.30** │ Slower than O1 │ 1700.11 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ SOR │ 1128.26 │ │ 1540.28 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ MonteCarlo │ 439.83 │ │ 587.74 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ Sparse matmult │ **1896.98** │ Slower than O1 │ 2443.62 ║ ║ ├─────────────────────┼─────────────┼────────────────┼───────────╢ ║ │ LU │ 4163.75 │ │ 5609.94 ║ ╚══════╧═════════════════════╧═════════════╧════════════════╧═══════════╝ ```

llvm / llvm-project

FFT and Sparse matmult in Scimark2 are slower with `-O2` & `-O3` than `-O1` #42921

Extended Description