Implement segmented_vector as replacement for std::vector

martinus commented 1 year ago

Is your feature request related to a problem? Please describe. vector resizes have high memory peaks, and lots of elements need to be moved around.

Describe the solution you'd like Implement a "segmented vector": Datastructure is similar to std::deque, but a bit better optimized for random access:

Index array could be just an std::vector
Each segment size is 2^x, then indexing is just data[idx >> x][idx % x]; which is very fast.
Initially, until the size of segment is reached, it could behave the same as std::vector. Then it's possible to set a relatively lage segment size (e.g. 65k), and it would still be small when only few elements are in there. Best of both worlds.

Describe alternatives you've considered Always double the size of the segment, and use bitcount to find out which segment to use. This is more difficult and slower though.

Additional context Would be nice to use something like that for e.g. bitcoin.

Andersama commented 1 year ago

Funnily enough I made a similar structure that sounds extremely similar only a couple months ago but let it live on the stack. I didn't get to making it a fully dynamic so I wouldn't have thought to do this, but you could in theory not "double" the segment size, but just your allocation size. Then fill your pointer / indexing array by spreading the pointers out across the allocation. That way you still can use / % or >> & operators to traverse the array (although the index array would be larger than normal). Then the awkward part could be shuffled off to when you need to clear the data structure by only destructing every power of two indexed pointer.

martinus commented 1 year ago

Hi @Andersama, that's an interesting idea. I need a larger array for the pointers though for that.

Another interesting idea is to just skip indices. E.g. for a maximum segment size of 128, first allocate 4 elements, index 0, 1, 2, 3. Then allocate 8 elements, but they get the indices 128, 129, 130, ... 136, and so on until max segment size is reached. Indexing is then still data[idx >> 7][idx & 7]. The disadvantage is that several indices are unused, but just for using it as a base for unordered_dense map would be ok

Andersama commented 1 year ago

What about std::countl_zero to index into the top array? Something like:

size_t ptr_idx = 64 - std::countl_zero(index);
size_t lower_mask = (1 << ptr_idx) - 1;
return data[ptr_idx][index & lower_mask];

I might have an off by one error in there but you get the idea.

Then your index array can be a fixed 64 ptrs wide.

I'm sure there's a better way of doing this*, this fails at indexs ~{0} and ~{0}-1


struct dual_index {
    size_t upper = {};
    size_t lower = {};
};

//capacities of each allocation are assumed to be powers of 2
constexpr dual_index split_index(size_t index) noexcept {
    dual_index result = {};
    // make a full 0xff...ff mask
    constexpr decltype(index) all_mask = ~decltype(index){0};
    constexpr size_t adjust = 1; //use 1 or higher, picks the starting size (2x pattern of growth holds at 1 and up)
    size_t input = index + (1 << (adjust-1)); //(1 << (adjust - 1)) indexs at top end will fail

    size_t zero_countl = std::countl_zero(input);
    // categorize the input by powers of 2, we flip the output range on it's head
    // so that lower values -> lower indexs
    size_t zero_countl_flipped = (sizeof(index) * 8 - zero_countl); 

    // create a bitmask from our index, we want it to lop off the top msb from our input
    size_t mask = (all_mask >> zero_countl) >> 1;

    result.upper = zero_countl_flipped - adjust;
    result.lower = mask & input;

    return result;
}

Here's an alternate:

constexpr dual_index split_index(size_t index) noexcept {
    dual_index result = {};
    // make a full 0xff...ff mask
    constexpr decltype(index) all_mask = ~decltype(index){0};
    // make a mask discarding top bit
    constexpr decltype(index) disard_mask = all_mask >> 1;

    constexpr size_t adjust = 1; //use 1 or higher, picks the starting size (2x pattern of growth holds at 1 and up)
    size_t input = index + (1 << (adjust-1)); //(1 << (adjust - 1)) indexs at top end will fail

    size_t zero_countl = std::countl_zero(input);
    // categorize the input by powers of 2, we flip the output range on it's head
    // so that lower values -> lower indexs
    size_t zero_countl_flipped = (sizeof(index) * 8 - zero_countl); 

    // create a bitmask from our index, we want it to lop off the top msb from our input
    size_t mask = disard_mask >> zero_countl;

    result.upper = zero_countl_flipped - adjust;
    result.lower = mask & input;

    return result;
}

To save one addition we can do this:

constexpr dual_index split_index_quick(size_t index) {
    dual_index result = {};
    // make a full 0xff...ff mask
    constexpr decltype(index) all_mask = ~decltype(index){0};
    // make a mask discarding top bit
    constexpr decltype(index) disard_mask = all_mask >> 1;

    constexpr size_t adjust = 1; //use 1 or higher, picks the starting size (2x pattern of growth holds at 1 and up)
    constexpr size_t discarded = (1ULL << (adjust - 1));

    size_t input = index + discarded; //(1 << (adjust - 1)) indexs at top end will fail

    // categorize the input by powers of 2
    // if we reserve our allocations and go in reverse order we can use
    size_t zero_countl = std::countl_zero(input); // w/o any extra work

    // create a bitmask from our index, we want it to lop off the top msb from our input
    size_t mask = disard_mask >> zero_countl;

    result.upper = zero_countl;
    result.lower = mask & input;
}

Here's the version recovering the discarded values*

constexpr dual_index split_index_safe(size_t index) noexcept {
    dual_index result = {};
    // make a full 0xff...ff mask
    constexpr decltype(index) all_mask = ~decltype(index){0};
    // make a mask discarding top bit
    constexpr decltype(index) discard_mask = all_mask >> 1;

    constexpr size_t adjust = 4; //use 1 or higher, picks the starting size (2x pattern of growth holds at 1 and up)
    constexpr size_t discarded = (1ULL << (adjust - 1));
    constexpr size_t fullcapacity = bitspread_left(discarded);

    size_t input = index + discarded; //(1 << (adjust - 1)) indexs at top end will fail

    size_t zero_countl = std::countl_zero(input);
    // categorize the input by powers of 2, we flip the output range on it's head
    // so that lower values -> lower indexs
    size_t zero_countl_flipped = (sizeof(index) * 8 - zero_countl);

    // create a bitmask from our index, we want it to lop off the top msb from our input
    size_t mask = discard_mask >> zero_countl;

    if (input > index) { // should be an overflow flag
        result.upper = zero_countl_flipped - adjust;
        result.lower = mask & input;
    } else {
        result.upper = ((sizeof(index) * 8) + 1) - adjust;
        result.lower = index - fullcapacity;
    }

    return result;
}

martinus commented 1 year ago

Thanks for the effort @Andersama, I actually had that implementation using CLZ intrinsic (can't use std::countl_zero because I want it to work in C++17) implemented a while ago, I've got an older version here: https://gist.github.com/martinus/a5d1c5114cfd1930eec9b620a7803f7a#file-stable_vector-h-L289

For some reason that was slower though, even when I used a fixed size std::array for indexing on the stack. Also I kind of like the incremental size when allocating fixed size segments, that way there is never a lot of memory overhead.

Andersama commented 1 year ago

Interesting, what if you did:

    constexpr auto operator[](std::size_t i) const noexcept -> T const& {
        if (i < StartSize) { //not sure what StartSize is, probably could drop this condition
            return svo()[i];
        }

        auto msab = most_significant_active_bit(i);

        // block index is number of bits in i, minus number of bits already covered by the first block (the svo).
        auto* block = m_blocks[msab]; //reverse order the allocated blocks

        // clear the highest bit to get to the index within the block
        return block[i ^ (1U << msab)];
    }

martinus commented 1 year ago

done! Just merged

martinus / unordered_dense

Implement segmented_vector as replacement for std::vector #54