tcbrindle / flux

A C++20 library for sequence-orientated programming
https://tristanbrindle.com/flux/
Boost Software License 1.0
441 stars 28 forks source link

flux::iota prevents auto-vectorization. #173

Closed SylvanBrocard closed 4 months ago

SylvanBrocard commented 4 months ago

The CPPNorth talk got me excited with the filter analysis thing, but it seems even in flux, the use of filter prevents auto-vectorization. https://flux.godbolt.org/z/qjr85c4aY Here we can see that the imperative code vectorizes normally but not the flux code. There is the same problem with the stdlib range view adaptors.

Below is a comparison with Rust to show that this kind of things should be possible. My understanding of the issue is that filter being non-shape preserving is what prevents vectorization, but I thought the filter consumer analysis would help with that.

I believe there might be a way to make it work by having the filter adaptor return some optionals instead of changing the shape of the sequence. Thoughts?

tcbrindle commented 4 months ago

Hi @SylvanBrocard, thanks for the bug report! This is definitely an interesting one.

Looking at your code, the first thing that occurs to me is that the pipeline you've used (iota -> filter -> map -> sum) should be using Flux's internal iteration code path, and so the shape-changing aspect of filter shouldn't be relevant -- we just end up calling the same predicate on every element of the iota sequence anyway. I'm pretty sure that the equivalent Rust pipeline is also using internal iteration.

Indeed, if we change the data source of flux_sum to be a vector instead of flux::iota, we can see that we do now get auto-vectorisation (at least with Clang, but since that uses the same optimiser as rustc it's a more apples-to-apples comparison anyway). This strongly suggests to me that the culprit in this case is iota, rather than filter.

Going back to the original code, turning on Clang's missed vectorisation reporting hints that there's something about iota's end check that it doesn't like (as a wild guess, it can't calculate the trip count?).

To investigate, I hacked together a very quick and dirty replacement for flux::iota with a specialisation of for_each_while (Flux's internal iteration customisation point). It turns out that for some reason, the Clang auto-vectoriser really wants the end cursor to be saved into a local variable -- despite the fact that it's very likely that everything is getting inlined in flux_sum at O3, so I would have though it would be able to determine that self.end can't get modified anywhere... Anyway, after making this change, we can see that we do indeed get auto-vectorisation: https://flux.godbolt.org/z/vPsGd9qWb

So the solution in this case is for iota to gain a specialisation of for_each_while which explicitly makes a copy of the end cursor in its implementation. Fortunately, that should be pretty simple to do.

tcbrindle commented 4 months ago

...and having said all that, I do agree that an alternative version of filter which yields optionals would be an interesting adaptor in its own right (I remember @brycelelbach talking about it on an episode of ADSP once). So we could definitely look at adding that in addition to the above fix.

SylvanBrocard commented 4 months ago

(I remember @brycelelbach talking about it on an episode of ADSP once).

Thank you, that's exactly what I was thinking about but couldn't remember the name of the podcast (it's episode 124).

tcbrindle commented 4 months ago

After merging #181, Clang now generates 100% identical code to Rust for your example: https://flux.godbolt.org/z/8PWx8fsG6

Rather than adding a for_each_while specialisation just for iota as mentioned in the previous comment, I actually went for a more general solution of providing a generic for_each_while specialisation for all multipass, bounded sequences. Hopefully this means that more sequences can now benefit from auto-vectorisation as well.

Thanks very much for the bug report @SylvanBrocard, I'm happy to have been able to improve this!