tcbrindle / flux

A C++20 library for sequence-orientated programming
https://tristanbrindle.com/flux/
Boost Software License 1.0
472 stars 29 forks source link

Using Flux to adapt a circular buffer : bug #133

Closed pfeatherstone closed 9 months ago

pfeatherstone commented 9 months ago

I am trying to use Flux to simplify dot products with circular buffers. I have an implementation that manually uses STL on two sub-ranges. This is really fast, but you have to do some manual book-keeping and if you care about more than one STL algorithm (inner_product in this case), then you have to reimplement everything yourself on the two sub-ranges.

I want to use Flux to define a sequence adaptor, and since a flux sequence defines iterators, I can in theory use that with std::inner_product to compute my dot products.

However, I'm having trouble adapting the container. Here is a short link https://godbolt.org/z/3PKeT5Wee that better explains my intention. Thank you

tcbrindle commented 9 months ago

Hi @pfeatherstone, thanks for your interest in Flux! The good news is that there isn't actually a Flux bug here :)

As a C++98 algorithm, std::inner_product expects a (begin, end) iterator pair as its first two arguments. However, as mentioned in the documentation, for a non-bounded_sequence Flux instead returns a C++20 sentinel from its end() function. (If you think about it, this makes sense: if Flux doesn't know where the end of the sequence is, it can't produce an iterator to that position in constant time.)

There are several ways we can fix this:

Unfortunately, as you can see, the runtime of the Flux version is not as good as the custom circular_buffer::dot_product() implementation. From what I can tell, this is because of the way the custom version uses its special knowledge of the circular_buffer internals. It makes two calls to std::inner_product, each of which uses a simple linear data access pattern. Because of this simple access pattern, the compiler is able to auto-vectorise both calls to inner_product, resulting in excellent run time.

For the Flux version, there is only a single call to std::inner_product over the whole sequence, but the access pattern (using v.buf[(cur + v.offset) % v.size()]) seems to be too complicated for the auto-vectoriser to see through, so we don't get any SIMD and things run slower.

This is just a consequence of the custom circular_buffer::dot_product() having extra knowledge about the implementation and the compiler being able to optimise using that knowledge. If you were to add a custom circular_buffer::iterator implementation and made a single call to std::inner_product with it, you'd very likely get the same runtime as the version using Flux.

I hope this helps! I'm going to close this as it's not a Flux bug, but feel free to comment further if you have any questions, or file other issues if you come across any other problems :)

pfeatherstone commented 9 months ago

Thank you for the great response! Everything you wrote makes sense and I didn't know about the extra customisation point in the flux traits.

Out of interest, do you intend on adding numeric algorithms to flux like reductions, such as inner_product?

Also, is there a way to adapt a container as the concatenation of two contiguous containers?

And if so, could flux have overloads of numeric algorithms for concatenated contiguous sequences such that they could make use of optimisations similar to what I did in dot_product ?

pfeatherstone commented 9 months ago

Also, I have to say, flux is SO MUCH NICER than c++ iterators, STL algorithms and ranges! Adapting containers in flux is approximately 42000 times easier than writing custom iterators or range adapters

tcbrindle commented 9 months ago

Thank you for the great response! Everything you wrote makes sense and I didn't know about the extra customisation point in the flux traits.

Out of interest, do you intend on adding numeric algorithms to flux like reductions, such as inner_product?

We already have quite a few such algorithms, although we don't have an inner_product with the same default operations as the std version. However, you can use flux::zip_fold() with a lambda:

std::vector<int> vec1 = get_vector1();
std::vector<int> vec2 = get_vector2();

auto fn = [](int sum, int x, int y) { return sum + (x * y); };

auto result = flux::zip_fold(fn, 0, vec1, vec2);

Also, is there a way to adapt a container as the concatenation of two contiguous containers?

Yes. The flux::chain adaptor takes an arbitrary number of sequences and "glues" them together as one logical sequence. For example:

std::vector vec1{4, 5, 6};
std::vector vec2{1, 2, 3};

auto chained = flux::chain(std::move(vec2), std::move(vec1));

chained.write_to(std::cout); // prints [1, 2, 3, 4, 5, 6]

In general, the chained adaptor uses the highest common sequence category of all of its parts (except that it can never be contiguous, for obvious reasons). In this case the chained sequence is random-access.

And if so, could flux have overloads of numeric algorithms for concatenated contiguous sequences such that they could make use of optimisations similar to what I did in dot_product ?

For the (majority of) algorithms which take a single sequence and iterate over it in order from start to finish, the chain adaptor already does this. For example, something like

// using vec1 and vec2 from above
auto total = flux::chain(flux::ref(vec1), flux::ref(vec2)).sum();

will effectively call the sum() algorithm on vec1 followed by vec2 and then add the results together. This leads to excellent code generation in a lot of cases. (The key to this is specialising the for_each_while customisation point to allow internal iteration.)

For algorithms like dot_product that operate on multiple sequences however, things are much much trickier to implement generically, as we don't know which is the "special" sequence that we want to dispatch to -- and what happens if both sequences are "special"?

I haven't been able to come up with a good way of doing this for multiple sequences, though it would certainly be nice if it turns out to be possible. In the mean time, for something like dot_product I think your approach of using a custom member function is probably going to offer the best performance.

Also, I have to say, flux is SO MUCH NICER than c++ iterators, STL algorithms and ranges! Adapting containers in flux is approximately 42000 times easier than writing custom iterators or range adapters

Thank you very much, this was a major goal when I was designing the library so I'm delighted to hear it was successful!

pfeatherstone commented 9 months ago

Thank you again. I added your zip_fold version, a version using rangeV3 and compared with clang. Yeah, I think a custom algorithm is best. Sad. However, I think flux is faster than ranges. https://godbolt.org/z/MKx69n7br

Edit: Running this on my machine I get:

STL      dot 838300, computed in 17.433369 ms
STL     rdot 671650, computed in 17.683371 ms
Flux     dot 838300, computed in 47.490080 ms
Flux    rdot 671650, computed in 47.387517 ms
Flux     dot 838300, computed in 50.761227 ms
RangeV3  dot 838300, computed in 48.475297 ms
RangeV3 rdot 671650, computed in 93.008346 ms

So maybe flux and ranges have similar performance.

tcbrindle commented 9 months ago

I have good news!

I had a look at your link and saw that you changed the implementation of read_at() for circular_buffer to try to simplify the access pattern. This gave me an idea: by removing the conditional branch and using a multiplication instead, it looks like the compiler is able to recognise what we want to do and vectorise the Flux version too, giving a run-time which is on par with the custom dot_product

EDIT: Sadly not, I made a mistake and it doesn't actually work :(

pfeatherstone commented 9 months ago

@tcbrindle Awesome! Though it looks like you need gcc 12 onward. gcc 11 didn't seem to optimize as well. Compilers are magic