vorner / slipstream

Nudging the compiler to auto-vectorize things
Apache License 2.0
71 stars 4 forks source link

Migrate to multiversion 0.7, better matmul example #13

Closed HadrienG2 closed 6 months ago

HadrienG2 commented 1 year ago

So, in preparation for #8 I wanted to move to multiversion v0.7 because its API is so much simpler. But one problem was the matmul example, which made use of advanced multiversion v0.6 tricks that are not possible with v0.7.

Since that example was not a very good SIMD matmul anyway, I thought I would rewrite it into something that's AFAIK optimal modulo alignment issues. But maybe that's now a little too complex for example code though. Tell me what you think!

HadrienG2 commented 1 year ago

Ping @vorner ?

vorner commented 1 year ago

Sorry. I'm having it open in one of my browser tabs and trying to get around to it. From a quick glance, I'd have preferred something more lightweight as an example… but I'll give it a proper read soon.

HadrienG2 commented 1 year ago

I was wondering if it wouldn't make sense to have a dot product example (which is much simpler to optimize and you already have it as a test) and link back to that with an introductory comment like "(Efficient) SIMD matrix multiplication is quite a bit harder than SIMD dot product, if you're just getting started with SIMD or this library you may want to check out the dot product example first".

The reason why I think it's important to do matmul right even if it makes the code more complicated is that this library is fundamentally about performance optimization, and targeted at people familiar with the craft. So if we don't show something close to the 32x speedup theoretically expected from AVX + ILP + FMA, it may make the library a harder sell to people familiar with hardware specs who know what to expect.

vorner commented 1 year ago

The reason why I think it's important to do matmul right even if it makes the code more complicated

Point taken.

OK, the plan with the dot product and the introductory example should probably work.

HadrienG2 commented 1 year ago

I've started working on a dot product example, and there I discovered the reason for some suspicious perf numbers I was getting with matmul: the compiler optimizer does not manage to optimize slice.vectorize() as well as the standard slice.chunks_exact(V::LANES).map(V::new) equivalent.

I'll see if I can fix this...

HadrienG2 commented 1 year ago

Maybe let's wait a bit as I have some cool things in the pipeline, e.g. rolled out a small crate that together with the new vectorizer impl lets me have much nicer loop unrolling syntax for ILP : https://github.com/HadrienG2/slipstream/blob/78190438daae1c68873060e6ac1d745d09d1cb03/examples/dot_product.rs#L93

HadrienG2 commented 6 months ago

Unfortunately the vectorize refactor turned out to be too big for my resources. I guess I should close this, unless you're interested in the subset provided by this PR?

vorner commented 6 months ago

Well, I haven't really touched this repository in years, so I guess I don't really have the time for it either.

HadrienG2 commented 6 months ago

Closing then ;)