LinearPOT? - Githubissues

tower120 commented 1 year ago

Having Linear power-of-two mem_config would allow to use cheap bit shifts instead of divisions and modulos. Thou I tested code from Linear as-is with POT capacity, and it seems that compiler are able to optimize it into bit operations in optimized build...

P.S. Maybe other mem_config modes could benefit from using POT size as well.

cehteh commented 1 year ago

That's the reason why I kept the 'FACTOR' a const generic, when you use power of two there, then the compiler is able to optimize this pretty well as the values are known at compile time. There is no more needed to benefit from this optimization, maybe it should be better documented though.

tower120 commented 1 year ago

Fine by me. Thou LinearPOT could still benefit from the very fact of compile-tine pot checks with ilog2.

cehteh commented 1 year ago

Fine by me. Thou LinearPOT could still benefit from the very fact of compile-tine pot checks with ilog2.

I am pretty sure this optimizes when used with a constant, but I admit I haven't checked this. Esp. not on any other arch than x86_64. Did you check the generated asm in --release build?

Give it a try, send PR when it improves the performance significantly.

Meanwhile the MemConfig trait is deliberately exported to allow users of the crate to implement their own variant when the 3 shipped variants are not sufficient. I haven't done that but I am curious how that works for you, when it does not its worth a fix.

tower120 commented 1 year ago

As I said before I did try division on pot number in godbolt. It replaces divisions with bit shifts.

The thing is:

This is not guaranteed behavior.
You may accidentally pass non-pot number and that would cause HUGE performance hit for get's. You will notice that immediately, or that will takes hours to find the reason latter.

cehteh commented 1 year ago

On Tue, 05 Sep 2023 05:21:41 -0700 tower120 @.***> wrote:

As I said before I did try division on pot number in godbolt. It replaces divisions with bit shifts.

The thing is:

This is not guaranteed behavior.

If we wont trust on trivial optimizations then we would all still be programming in assembler. While not guarantted its pretty much given and expected that this becomes optimized, if not thats rather something to address in the compiler. Note: there are some architectures where shifts are expensive.

You may accidentally pass non-pot number and that would cause HUGE performance hit for get's. You will notice that immediately, or that will takes hours to find the reason latter.

I doubt anyone will pass a non pot number 'accidentally' since its a constant (documentation should make this clearer). It has valid uses to use non pot numbers. Dunno what Jacob thinks, his crate his decision. Maybe send a PR that implements the XxxPoT MemConfig implementations.

I am curious, did you profiled the performance impact of using a non PoT number? Since we are dividing/modulo a constant even this should optimize decently for many numbers (some bigger primes would be hard). Surely it takes more cycles, but compared to real work you do on the data its likely still neglible. Dividing/modulo by a variable would be the costly thing I avoided there.

mccolljr commented 1 year ago

I think we are possibly mixing two concerns here:

The concern with performance, where powers of two can indeed yield better results due to the simpler approach to modulo operations.
The concern with memory usage, where optimizing the specific segment size for linear storage will yield best results due to the tightly controlled memory footprint

I also wonder if scenarios where the speed of individual element access is important are the right scenarios to use a data structure that is designed to spread storage around the heap. It seems like the potential penalty from cache misses will pose a similar problem even if you can avoid slow modulo operations.

Based on the discussion so far, I think I agree that the compiler is generally very good at optimizing math operations that involve a constant value. I think that (in general), that can be relied upon. If there are specific scenarios where more fine-grained control is needed, I am always happy to review PRs and add features that people think are useful.

As an aside, I wonder if maybe the feature to target is something else, like the ability to expose the contiguous memory of a whole segment as a slice so that one can avoid the cost of any extra math for indexing, and so that one can control when they incur the cache miss penalty (by choosing when to start accessing the data in the next segment).

Thoughts?

tower120 commented 1 year ago

In SegVec you basically have Vec<Vec<T>>, with element access equal to accessing **T. Accessing elements in the same segments, have performance surprisingly close to per-index accessing Vec elements, due to the fact that segment addresses already in cache.

cehteh commented 1 year ago

On Tue, 05 Sep 2023 06:31:11 -0700 Jacob McCollum @.***> wrote:

I think we are possibly mixing two concerns here:

The concern with performance, where powers of two can indeed yield better results due to the simpler approach to modulo operations.

The concern with memory usage, where optimizing the specific segment size for linear storage will yield best results due to the tightly controlled memory footprint

I also wonder if scenarios where the speed of individual element access is important are the right scenarios to use a data structure that is designed to spread storage around the heap. It seems like the potential penalty from cache misses will pose a similar problem even if you can avoid slow modulo operations.

Based on the discussion so far, I think I agree that the compiler is generally very good at optimizing math operations that involve a constant value. I think that (in general), that can be relied upon. If there are specific scenarios where more fine-grained control is needed, I am always happy to review PRs and add features that people think are useful.

As an aside, I wonder if maybe the feature to target is something else, like the ability to expose the contiguous memory of a whole segment as a slice so that one can avoid the cost of any extra math for indexing, and so that one can control when they incur the cache miss penalty (by choosing when to start accessing the data in the next segment).

Thoughts?

SegmentedIter/SegmentedMutIter gives you &[T]'s. It is not indexable yet, that might be an worthy modification (needs more thoughts).

There is still my PR #20 which could be closed, I am lamenting there about the cost and benefits of caching segments.

https://github.com/mccolljr/segvec/pull/20#issuecomment-1587417364

For me the conclusion was that the experiment worked, but 'Linear' with reasonable big FACTOR config is always the better choice. When one really wants to have fine control then starting at lower level and use a SegmentedIter might be the way to go.

Other thing:

I could imagine a higher level abstraction where one gives an actual SIZE instead of a FACTOR, saying

LinearBlock<T, const SIZE: usize,const N: usize = 1>

Where it then does the (constant) math to calculate how many TN will fit into SIZE (or multiples of that when SIZE < size_of::()N) and as well aligns/pads allocations to SIZE. This way you can get nicely cachline/pagesize/whatever aligned allocations.

The math part is just some sugar you can do manually already. Aligning and padding allocations is something not yet possible but the allocator may do that on our back already.

Considering my previous experiments the benefits here may be too small or none (when the allocator already outsmarts us). More serious benchmarking in real scenarios will be needed.

mccolljr / segvec

LinearPOT? #35