ryao commented 1 year ago

Idea

We should try using superoptimization on key assembly routines to make them faster.

Background

The term optimization originated in mathematics where it means the perfect result, but achieving perfection in computer science is often infeasible, so compiler optimization focuses on achieving a better result. In 1987, a paper was published that coined the phrase superoptimization to refer to the original mathematical meaning. It achieved the impressive result, of optimizing the following C code to 4 instructions without branching:

int signum(x) {
    if (x > 0) return 1;
    else if (x <  0) return -1;
    else return 0;
}

add.l d0, d0
subx.l d1, d1
negx.l d0
addx.l d1, d1

In x86 assembly, this can be further reduced to 3 instructions:

cwd
neg ax
adc dx, dx

https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/superoptimizer/ https://research.swtch.com/superopt

In principle, superoptimization can produce assembly output that is well beyond the state of the art from production compilers, and can match or even exceed the output produced by hand.

The downside of superoptimization is that the run time is exponential. This made the first superoptimizer only useful for short functions of say ~10 instructions or less. However, modern work has increased the lengths of routines that can be superoptimized.

I first heard about superoptimization in a graduate class on compilers at Stony Brook university. There was not much said on it, other than that some industries will apply superoptimization to software where the compiler will take more than a month to run. Specifics, such as how large the codebases that took more than a month to compile with a superoptimizer were, or what software was used to perform superoptimization, were not given.

Available superoptimizers relevant to us

I was only able to find 3 superoptimizers that apply to the architectures we support:

STOKE

There is an interesting paper from Stanford that describes a super optimizer for x86-64:

https://cs.stanford.edu/people/eschkufz/docs/asplos_13.pdf

The stanford paper’s code is available on GitHub:

https://github.com/StanfordPL/stoke

However, it requires GCC 4.9 to compile, which is no longer available on Gentoo, so I have not been able to try it.

Souper (for LLVM)

There is also the Souper optimizer for LLVM bytecode that works on multiple architectures:

https://github.com/google/souper

Unfortunately, that generality causes souper to fail to produce the kind of results that target specific superoptimizers produce. I have already tested it on #14234 for the native, byteswap and fini functions (minus the FP/SIMD register restore). I had hoped for an improved the mixing matrix calculation, but it failed to produce an improvement. I suspect that it not being target specific could be the reason, since the clever results made by superoptimizers often rely on ISA specifics. Another possible reason is that perhaps the assembly output (for the unrolled mixing matrix calculation) is already optimal at the level of LLVM bytecode. That would explain why a few unnecessary permutations appear to be used in the output on the byteswap case on ppc64. There are also some register spills on ppc64 that are not present on both aarch64 and x86-64.

Note that its build_deps.sh script failed until I patched it on my machine:

diff --git a/build_deps.sh b/build_deps.sh
index f823284..c791a0d 100755
--- a/build_deps.sh
+++ b/build_deps.sh
@@ -59,10 +59,10 @@ mkdir -p $alivedir $alive_builddir
 git clone $alive_repo $alivedir --branch $alive_commit

 if [ -n "`which ninja`" ] ; then
-  (cd $alive_builddir && cmake ../alive2 -DZ3_LIBRARIES=$z3_installdir/lib/$Z3_SHAREDLIB -DZ3_INCLUDE_DIR=$z3_installdir/include -DCMAKE_BUILD_TYPE=$llvm_build_type -GNinja)
+  (cd $alive_builddir && cmake ../alive2 -DZ3_LIBRARIES=$z3_installdir/lib64/$Z3_SHAREDLIB -DZ3_INCLUDE_DIR=$z3_installdir/include -DCMAKE_BUILD_TYPE=$llvm_build_type -GNinja)
   ninja -C $alive_builddir
 else
-  (cd $alive_builddir && cmake ../alive2 -DZ3_LIBRARIES=$z3_installdir/lib/$Z3_SHAREDLIB -DZ3_INCLUDE_DIR=$z3_installdir/include -DCMAKE_BUILD_TYPE=$llvm_build_type)
+  (cd $alive_builddir && cmake ../alive2 -DZ3_LIBRARIES=$z3_installdir/lib64/$Z3_SHAREDLIB -DZ3_INCLUDE_DIR=$z3_installdir/include -DCMAKE_BUILD_TYPE=$llvm_build_type)
   make -C $alive_builddir -j $ncpus
 fi

GNU Superopt

Another superoptimizer is the GNU superoptimizer, which supports a range of architectures:

https://savannah.gnu.org/projects/superopt/

Unfortunately, it appears to only support the ISAs in use 25 years ago, which predate SIMD ISA extensions (and even x86-64), so it is useless for improving our assembly routines, unless we extend it, which is a rabbit hole that I do not recommend.

ryao commented 1 year ago

Here is a feasibility study from 2014:

https://www.embecosm.com/appnotes/ean15/ean15.pdf

It mentions both STOKE and GNU Superopt. STOKE is based on machine learning and probably should not be called a superoptimizer. That explains its failure to super optimize a test function in the STOKE paper (which I skimmed).

lin72h commented 1 year ago

it's an eye opener. thanks

hanno-becker commented 7 months ago

If the core routines are already written in assembly, you can also consider https://github.com/slothy-optimizer/slothy for further micro-optimization, rather than peephole optimization.

openzfs / zfs

Superoptimization of key assembly routines #14260