simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.32k stars 238 forks source link

automated analysis of compiler results #349

Open mr-c opened 4 years ago

mr-c commented 4 years ago

https://gitter.im/simd-everywhere/community?at=5edb4999f0b8a2053ad9298d

@nemequ wrote

One day I'd like to write a script which automatically generates trivial wrappers (like the one in that godbolt link), feeds them into llvm-mca, and generates a report of the differences.

If we can get some disassembly of $FAST_COMPILER outperforming $SLOW_COMPILER, we can add an intrinsics-based version which should make $SLOW_COMPILER just as fast as $FAST_COMPILER.

And report the differences to the $SLOW_COMPILER to fix in a future release of their compiler

nemequ commented 4 years ago

My comments about this last night were kind of off-the-cuff, but this is actually something I've been thinking about for a while. A few more details about how I see this working:

Obviously I'm looking at this from SIMDe's perspective, but I think the tool would actually be useful for a huge number of projects. Basically any performance-critical code could be monitored using this system.

The way I see it there are four parts to this. The only part that is somewhat specific to SIMDe would be part 1, but that really wouldn't require too much code. I could probably be convinced to do that part. Other than that, this wouldn't really require any expertise in assembly, SIMD, or really even C.

Part 1

The first part would be write some code to automatically generate trivial wrapper functions that just pass through the arguments to the real function. For example, for _mm_add_epi32, the wrapper would look like

#include "path/to/simde/simde/x86/sse2.h"

simde__m128i mca_wrapper_func(simde__m128i a, simde__m128i bar) {
  return simde_mm_add_epi32(a, b);
}

This wouldn't be too hard to do using the XML data from the Intel Intrinsics Guide for x86. For NEON we can scrape ARM's documentation. The only somewhat tricky part is how to handle immediate-mode parameters, but I think at least initially we can just output 1, so for example the wrapper for simde_mm_slli_epi32 could look like:

#include "path/to/simde/simde/x86/sse2.h"

simde__m128i mca_wrapper_func(simde__m128i a) {
  return simde_mm_slli_epi32(a, 1);
}

Part 2

The next step would be to feed this into llvm-mca, and parse the output. Should be pretty straightforward.

I'd like to be able to run this part on CI. It would probably be too expensive to run on every commit unless someone is willing to donate more resources, but maybe a weekly cron job or when we commit to a specific branch.

It might also be interesting if you could do other things at this point than just use llvm-mca. Running uarch-bench might be interesting, for example. Or for larger functions, integration with a microbenchmarking framework like google-benchmark or Nonius.

Part 3

Next would be to take the output from part 2 and put it into some sort of database. It should be able to handle a large number of functions in multiple different configurations (mostly different compilers and target architectures). It would be nice if we could keep historical data (like for every git revision).

I don't think this would be particularly difficult, but it would likely require some sort of database that we'd have to host somewhere. Committing everything to a git repository might be workable, but it would make part 4 a lot slower.

Part 4

Once we have the data, there are lots of interesting ways to use it. Potential reports could include: