mccolljr / segvec

SegVec data structure for rust. Similar to Vec, but allocates memory in chunks of increasing size.
MIT License
33 stars 4 forks source link

const generics, FACTOR, START_SIZE #16

Closed cehteh closed 1 year ago

cehteh commented 1 year ago

Would it be possible for future versions to make the factor const generic and as well add a 'START_SIZE' const generic:

START_SIZE is the minimum capacity of segements[0].

type SegVec<V> = SegVecDetail<V, 1,1>; // backward compatible except the 'factor' API's become deprecated/NOP

// needs better name
struct SegVecDetail<V, const FACTOR: usize, const START_SIZE: usize> {
    len: usize,
    capacity: usize,
    segments: inner::Segments<T, START_SIZE>,
}

Rationale for FACTOR: This shaves another word from the structure and poses some better optimization opportunities to the compiler since 'factor' becomes constant (it cant be changed at runtime anyway).

Rationale for START_SIZE: This is the initial size for the first segment thus as well the minimum capacity. Currently the very first elements do a lot small/single allocations, these costs could be avoided when the first segment can hold more than 1 element. For indexing to work correctly this must be a constant as well, thus a const-generic would be appropriate.

mccolljr commented 1 year ago

@cehteh I like these suggestions, I think they make sense. I think my initial implementation came before const generics were fully stabilized, so this seems like a worthwhile evolution of the data structure.

I have some time later this weekend for personal projects, so I'll take a look at an initial implementation at that time.

mccolljr commented 1 year ago

Looking at the code, it seems that with the current implementation, FACTOR and START_SIZE end up referring to the same value - the size of the first segment will always be FACTOR*1. Do you have a use case where you would need FACTOR and START_SIZE to differ?

cehteh commented 1 year ago

On 2023-06-03 15:29, Jacob Ryan McCollum wrote:

Looking at the code, it seems that with the current implementation, FACTOR and START_SIZE end up referring to the same value - the size of the first segment will always be FACTOR*1. Do you have a use case where you would need FACTOR and START_SIZE to differ?

Seems I misunderstood the docs then, I'll look at the code/details next days.

What I had in mind was that one can select the a initial size and the constant/linear/exponential factor independently.

Simple thought experiment/brainfart:

with start-size=1/factor=1 grows constant, segments have the following sizes (impratical of couse): [1, 1, 1, 1, ...] changing start_size to 32 would become [32, 32, 32, 32...]

so 'factor=1' allows for slow grow at the cost for a potentially huge sector table with small segments.

with factor=2 things change/ becoming exponential: [1, 2, 4, 8, ...] [32, 64, 128, 256, ..]

factor 4 would be [1, 4, 16, 64, ...] [32, 128, 512, 2048, ...]

Actually this grows pretty fast it would be only practical for very small factors. I am wondering if one could come up with different schemes that don't grow so aggressively (having a parameter for linear grow?). linear=4: [1,5,9,13, ..]

Note 1: eventually the first segment could be a (START_SIZE) array inlined on the stack (like smallvec does).

Note 2: the SmallVec for the segments is currently hardcoded to 3 elements. This could be parametized too.

Inlining arrays on the stack should be optional. It will be pretty costly when one moves bigger SegVecs around while the absence for double derefencing in multiple places would certainly improve performance when data is on the stack.

My current use case is in a parser where I only need append-only Vec's but I want to optimize the case where resizing wont copy all data around. Expressions there can be vary greatly in length from tiny one-liners to expressions (lisplike) spanning very many tokens.

-- Reply to this email directly or view it on GitHub: https://github.com/mccolljr/segvec/issues/16#issuecomment-1575232344 You are receiving this because you were mentioned.

Message ID: @.***>

mccolljr commented 1 year ago

Seems I misunderstood the docs then, I'll look at the code/details next days.

I can almost certainly improve the docs, they were something of a last-minute addition when I decided to publish this. If you open issues for any portions you found confusing, I will make changes to make things more clear.

I'll see if I can shed some light, though: the way factor works right now is like this:

Another way to phrase this is that each additional segment doubles the capacity of the SegVec. Concretely, if you have a SegVec with a factor of 2, your storage would look like this:

[
  [_, _]                   // segment 0: size 1 * 2 = 2
  [_, _]                   // segment 1: size (pow(2,1-1) * 2) = 2
  [_, _, _, _]             // segment 2: size (pow(2,2-1) * 2) = 4
  [_, _, _, _, _, _, _, _] // segment 3: size (pow(2,3-1) * 2) = 8
  /* 16, 32, etc... */
]

The intention behind this was to adapt the doubling re-allocation scheme that Vec uses (used? maybe this changed) but without the need to keep excess memory around if the collection changed sizes. I added the factor concept to enable using larger "base" buckets to allow users to tune the number of elements per segment. For now, if you're looking for fixed-size buckets, there is https://crates.io/crates/bucket_vec (and a few others whose names I forget).

I am wondering if one could come up with different schemes that don't grow so aggressively

I'll think on this, I like the idea of a unified data structure that can use an exponential scheme or a linear scheme depending on the needs of a particular collection.

Inlining arrays on the stack should be optional.

The usage of smallvec is gated behind the small-vec feature. If you use the default features, the inner structure will be Vec<Vec<T>>. If you use the small-vec feature flag, this becomes SmallVec<[Vec<T>; 3]>. The downside to this approach is that if you enable the small-vec feature, it uses SmallVec for all SegVec values in your program. I came up with a plan to make the segment storage generic via a new Storage trait so that it is possible to roll your own storage mechanism(s), but this is still in-progress.

cehteh commented 1 year ago

Some thoughts below, dunno if you want to go that far, but I wanted to write that down anyway.

On 2023-06-03 22:21, Jacob Ryan McCollum wrote:

Seems I misunderstood the docs then, I'll look at the code/details next
days.

I can almost certainly improve the docs, they were something of a last-minute addition when I decided to publish this. If you open issues for any portions you found confusing, I will make changes to make things more clear.

I'll see if I can shed some light, though: the way factor works right now is like this:

  • the segment at position 0 has a capacity of 1 * factor
  • the segment at any position i where i is > 0, has capacity pow(2, i-1) * factor

Another way to phrase this is that each additional segment doubles the capacity of the SegVec. Concretely, if you have a SegVec with a factor of 2, your storage would look like this:

[
 [_, _]                   // segment 0: size 1 * 2 = 2
 [_, _]                   // segment 1: size (pow(2,1-1) * 2) = 2
 [_, _, _, _]             // segment 2: size (pow(2,2-1) * 2) = 4
 [_, _, _, _, _, _, _, _] // segment 3: size (pow(2,3-1) * 2) = 8
 /* 16, 32, etc... */
]

The intention behind this was to adapt the doubling re-allocation scheme that Vec uses (used? maybe this changed) but without the need to keep excess memory around if the collection changed sizes. I added the factor concept to enable using larger "base" buckets to allow users to tune the number of elements per segment. For now, if you're looking for fixed-size buckets, there is https://crates.io/crates/bucket_vec (and a few others whose names I forget).

I am wondering if one could come up with different schemes that don't grow so aggressively

I'll think on this, I like the idea of a unified data structure that can use an exponential scheme or a linear scheme depending on the needs of a particular collection.

After some thinking I'd like the idea of having some generalized formula like:

new_segment_size = current_capacity * E

where: E is 0 (for linear growth) or 1 for exponential growth, bigger numbers would (likely) grow too fast.

FACTOR is the size of segment[0] (how about naming this BASE_SIZE or similar?) having everything multiple of this simplifies the math a lot

FACTOR * N adds linearly, may be 0 or 1 as well, but larger numbers for N work here too.

FACTOR * P adds proportionally, same notes as above

Exponential growth is likely not necessary for SegVec because you don't have realloc/copy costs. it only boils down to keeping the segments array small enough, optimize for cache hits.

With const generics the compiler should be able to optimize the math here pretty well removing parts of the formula that have no effect (0) or remove unnecessary multiplication (1).

Inlining arrays on the stack should be optional.

The usage of smallvec is gated behind the small-vec feature. If you use the default features, the inner structure will be Vec<Vec<T>>. If you use the small-vec feature flag, this becomes SmallVec<[Vec<T>; 3]>. The downside to this approach is that if you enable the small-vec feature, it uses SmallVec for all SegVec values in your program. I came up with a plan to make the segment storage generic via a new Storage trait so that it is possible to roll your own storage mechanism(s), but this is still in-progress.

I don't see any use of thinvec here, thinvec uses simple pointers to the allocation but the metadata is stored there, so there is no benefit in the SegVec case I think.

If you want to go to the extreme and permit unsafe code SegVec could boil down to:

[repr(align(64))] // optional, cacheline

struct SegVec<T, ..CONFIG_CONSTANTS..> { len: usize, segments: SmallVec<[*mut T; 5]>, }

// assert_eq!(size_of::<SegVec<()>>(), 64) // on 64 bit systems

using raw pointers to allocations here because these point to allocations of potentially uninitialized data.

Segment sizes and capacity can be calculated by arithmetic, the segment/offset to which a index belongs as well. Thus we do not need to store length and capacity of the segments (as nested Vec's currently do) this saves a lot memory in the segment table, up to 5 segments can be addressed from a single cacheline (When I count right). If configured well then the double-deref over the segments table will become really fast.

Disadvantages of this approach would be:

Second thought about inlining data: Nah, was some bad idea, I wont need it anyway, it violates the pointer stability guarantee when a SegVec becomes moved.

-- Reply to this email directly or view it on GitHub: https://github.com/mccolljr/segvec/issues/16#issuecomment-1575398218 You are receiving this because you were mentioned.

Message ID: @.***>

cehteh commented 1 year ago

Closing this, as things got implemented and merged now (meanwhile slightly different than the initial thoughts, but performing well)