Bitsets with more than 16m values

tower120 / hi_sparse_bitset

Hierarchical sparse bitset

41 stars 1 forks source link

Bitsets with more than 16m values #32

Closed hniksic closed 6 months ago

hniksic commented 6 months ago

I'm trying to test hi_sparse_bitset on some custom in-memory indexes. BitSet is generic over a configuration type, and the crate offers 64-bit, 128-bit, and 256-bit configurations. If my understanding is correct, the largest one, _256bit, supports up to 16777216 distinct values. Is that correct?

For my use case, I need to support up to 50m entities. Can I add a larger configuration? I looked at the definitions of existing ones, but it was not exactly obvious how to come up with a new one.

tower120 commented 6 months ago

Yes you're correct about max range https://docs.rs/hi_sparse_bitset/0.6.0/hi_sparse_bitset/config/index.html.

Implementing BitBlock

You can add a larger configuration. You probably want _512bit configuration for that. You need to implement 512bit block for BitBlock first. See example of implementation here https://docs.rs/hi_sparse_bitset/0.6.0/src/hi_sparse_bitset/bit_block.rs.html#170. I recommend using SIMD intrinsic for your platform if it supports it (like https://doc.rust-lang.org/core/arch/x86/struct.__m512i.html), or something like [u128;4] otherwise. I hope that would be quite straightforward.

Implementing Config

Now you need to make Config struct with that _512 block. It is enough to use (512, 256, 512) config to achieve 512x256x512= 67_108_864 range. You can take https://docs.rs/hi_sparse_bitset/0.6.0/src/hi_sparse_bitset/config.rs.html#150 as example. Keep in mind that type of index-pointer at each level should be able to store max amount of blocks in next level.

pub struct _512bit;
impl Config for _512bit {
    type Level0BitBlock = block_512;

    // Here we use u16, instead of u8, since u8 can not handle 512 element
    // You can still use u8 here, if you somehow sure that you'll have <256 blocks in level1.
    type Level0BlockIndices = [u16; 512];

    type Level1BitBlock = block_512;

    // data level can have 512*512 blocks max. We need u32 for that.
    // Again, if you can guarantee that you'll have less blocks, then theoretical max,
    // you can use smaller type.
    type Level1BlockIndices = [u32; 512];

    type DataBitBlock = block_512;

    /// This is default one, used in all pre-defined configs
    /// Have effect only on `Reduce`.
    type DefaultCache = FixedCache<32>;
}

// This is optional. For SmallBitSet.
impl SmallConfig for _512bit{
    // same type as in Level1BlockIndices 
    // 12 - is the size of inlined storage for index-pointers
    //
    // Usually SIMD types have big align. Like 64 bytes here.
    // Block roughly looks like:
    // SmallBlock{ mask: SIMD, mask_pops: Level1MaskU64Populations, small_indices: Level1SmallBlockIndices }
    // Level1SmallBlockIndices + Level1MaskU64Populations should fit 64 bytes,
    // otherwise Block instantly become 64bytes more due to align.
    type Level1SmallBlockIndices  = [u32;12];

    // See https://docs.rs/hi_sparse_bitset/0.6.0/hi_sparse_bitset/config/trait.SmallConfig.html#associatedtype.Level1MaskU64Populations
    type Level1MaskU64Populations = [u16; 8];
}

I wanted to make a macro for that, but decided to not pollute crate namespace with that...

P.S. Alternatively, theoretically it is possible to increase levels depth (currently 3 in total). Which would greatly increase max range, but at a price of performance and memory overhead. But configurable depth is not implemented, and I would like not to implement it, at least until Rust will not get better support of meta-programming.

tower120 commented 6 months ago

I'm trying to test hi_sparse_bitset on some custom in-memory indexes.

What do you mean? What is "in-memory indexes"?

hniksic commented 6 months ago

Thanks for the detailed response. I will probably not attempt to implement 512-bit block myself, as I don't use nightly and do not have the time to get familiar with the implementation. I hoped to test hi_spase_bitset as a drop-in replacement for roaring, the way I tried roaring as a drop-in replacement for fixedbitset. (roaring's contains() was too slow for us.)

What is "in-memory indexes"?

I was referring to a data structure that serves similar purpose to a database index, but lives in memory, and is not persisted, but built from scratch at application startup. For example, assume you have some data:

struct Entity {
    status: Status, // some enum, convertible to u32
    start_date: NaiveDate,
    end_date: NaiveDate,
    name: CompactString,
    system_type: SystemType, // same deal as Status
    targets: Vec<Target>, // a multi-value property, Target is also an enum corecible to u32
    // ... a dozen more properites ...
}

We have up to 50 million of such entities, and need to be able to query entities restricted by some of these properties - e.g. by start_date being in some range, system_type being SystemType1 or SystemType2, targets containing a particular Target, etc. Any combination of these restrictions can be issued by the frontend, but they're all logically AND-ed. The result set has to be ordered by a display field (a separate index for that purpose) and narrowed down to a page specified by frontend. It's perfectly possible for a query to result in a couple million entities, and the backend needs to be able to order them and return the page consisting of the first 50 or the last 50 or anywhere in the middle. (The first couple of pages will be typically requested, but the API needs to support getting any page.)

Modeling this in an SQL database proved very challenging - we tried sqlite and postgres - because a significant percentage of these entities change from day to day based on external data. These daily updates were unacceptably slow because the 30+ indexes for searching and ordering made inserts and updates crawl. Neither rebuilding the database from scratch (with index building delayed until the end) nor updating only the entities that changed from previous day resulted in satisfactory performance. Also, even with all the indexes, the performance of retrieval wasn't exactly stellar, and the more complex queries could take several seconds to complete.

We are now prototyping an approach that just keeps everything (data and indexes) in memory, as we are running on servers with a lot of memory for other reasons. If we ignore sorting for a moment, a lookup index is a BTreeMap<Value, Vec<Rowid>>. Both Value and Rowid are u32, with Rowid being index into the Vec<Entity>. A query for three different properties needs to construct bitsets out of these Vec<Rowid> and intersect them. A "between" query for something like start_date traverses the relevant part of the bitmap and constructs a bitset that is the union of the rows - and this bitset takes part in the intersection.

Here I'd like to replace BTreeMap<Value, Vec<Rowid>> with BTreeMap<Value, Bitset>, so that I save the time taken to convert Vec<Row> to bitset. That works for properties with a small number of distinct values, but takes too much memory for properties with a large number of possible values, such as start_date or a hypothetical start_datetime (where each entity could have a different value). In my current prototype I have an enum that is BTreeMap<Value, FixedBitset> for properties with fewer than a threshold of distinct values, and BTreeMap<Value, Vec<Rowid>> for properties with many distinct values. This works nicely because properties with many values typically have few entities associated with each value, so creating a bitset on-the-fly is cheap. It still feels like a hack, though, and I wanted to see if hierarchical bitsets would allow me to always have a BTreeMap<Value, Bitset> in order to simplify the code and possibly introduce other speedups.

Sorting and pagination also presents interesting challenges, but this comment has already gotten too long.

tower120 commented 6 months ago

I will probably not attempt to implement 512-bit block myself, as I don't use nightly and do not have the time to get familiar with the implementation.

Ok, I think I can do that for you. What is your target platform?

roaring's contains() was too slow for us. ....

From what you described - looks like you want to reduce_w_cache(ops::And, bitsets.iter(), FixedCache<N>), where N - max number of bitsets, or if it is too big, like thousands - use DynamicCache. Then you can directly iterate result. If you need only part of result - you should intersect with generative RangeBitset (there is no such things in lib - but is fast to do, and I can make one).

I don't understand why you need specifically contains() - but you can have contains on lazy Reduce as well. Thou keep in mind - that it will intersect queried block on each contains. So if you contains() result often - you probably want to materialize it to concrete BitSet first. There is still no way to materialize bitset from lazy FAST thou... The only way for now is filling with insert or with "from usize iterator"... It should be doable on per-block basis instead...

Also, if your intersection will be HUGE - like hundreds of thousands elements - you can split intersection session with Cursor. And chew intersection in a few attempts. That trick will work even if source bitsets will change between sessions.

Otherwise - described task should fit ideally for hi_sparse_bitset.

hniksic commented 6 months ago

Ok, I think I can do that for you. What is your target platform?

Thanks, but please note that I cannot promise that I will actually use it. FWIW we are currently targeting x86-64-v3.

Fast conatins() is needed for the sorting/pagination step. There are basically two ways to sort: one is to assemble the result, sort it (by a pre-prepared index that efficiently maps a row to the correct order among all rows):

fn extract_sorted(&self, order_by: &OrderBy, results: FixedBitset, range: &Range<usize>) -> Vec<Rowid> {
    let order_index: &Vec<u32> = &self.order_indexes[order_by]; // maps row ids to their sort order by field
    let mut results: Vec<u32> = results.ones().map(|rowid| rowid as u32).collect();
    results.par_sort_unstable_by_key(|row| order_index[row]);
    results[range]
}

Here the sorting steps takes a non-negligible amount of time when the result set is large, and even collecting into Vec is not exactly free. So I switched to a different tactics, where I iterate by the order index, and weed out the results based on whether they're present in the bitset:

fn extract_sorted(&self, order_by: &OrderBy, results: FixedBitset, range: &Range<usize>) -> Vec<Rowid> {
    let order_index: &Vec<u32> = &self.order_indexes[order_by];
    index.iter().filter(|&&rowid| results.contains(rowid)).take(range.end).skip(range.start).collect()
}

This is especially fast if the result set covers most of the data, and we are looking for a page near the beginning. I might even do one or the other depending on the size of the result set. But this should illustrate why fast contains() is a win, and FixedBitset's contains() is, of course, blazing-fast.

Another motivation for fast contains() is text search. In the current prototype we don't have a dedicated index for full-text search like we had in SQL, but simply do a brute force search relying on the combination of a large number of CPUs with the raw speed of Rust's regex engine (and simplicity of regexes produced by our search). We still want to search only on the results of by-value lookups discussed above. This can again be done by first extracting the rows into a Vec and then iterating it in parallel, or by carving up the whole id space and using contains() to test whether each id is eligible for text search, and storing the result of the search in a bitset which is then intersected with the bitset that holds the results of the other queries. This strategy too relies on efficient contains().

I'm sure this is not a theoretically optimal design, but the idea is to find out how much can be achieved with a comparatively small amount of maintainable code under the constraints of our setup. Compressed bitmaps definitely look like an attractive solutions, but I think they'll require rethinking of the part of the design that heavily relies on contains() being essentially free.

tower120 commented 6 months ago

BitSet's contains() is fast, but not as fast as FixedBitset one. It is basically ***ptr.

It's just that all results are lazy and I still did not implemented efficient collect() for them.... If you can wait 2-3 days - I think I can make it - that would be fairly easy.

You could try it as is, but you'll likely take performance hit if you call contains often and intersect dozens of sets. Because resulting bit-block-mask will be ANDed N times for each contains. Which is actually fast, and could be sometimes faster then collecting whole BitSet though...

I don't understand how your sorted algorithm works, but if your order_index is big - you're not utilizing compressed bitsets strength - fast iteration... Wait! Looks like you can intersect order_index with result , and then iterate that:

fn extract_sorted(&self, order_by: &OrderBy, results: FixedBitset, range: &Range<usize>) -> Vec<Rowid> {
    let order_index: &BitSetSmall = &self.order_indexes[order_by];
    let filtered_result = order_index & results;
    filtered_result.iter().take(range.end).skip(range.start).collect()
}

Maybe there is even faster solution... Remember, you can make your own binary operation.

FWIW we are currently targeting x86-64-v3.

Ok, I'll make it as [wide::u64x4; 2] then. I'll let you know, when I make 512bit config.

tower120 commented 6 months ago

Ahh nevermid my last suggestion with extract_sorted. Looks like you return sorted RowIds and order_indexes store them sorted (not in ascending order).

tower120 commented 6 months ago

Looks like you need the thing I'm currently working on - HiSparseArray. Its like BitSet but with T on DataLevel, instead of bitblock.

fn extract_sorted(&self, order_by: &OrderBy, results: FixedBitset, range: &Range<usize>) -> Vec<Rowid> {
    let order_index: &HiSparseArray<RowId> = &self.order_indexes[order_by];
    let filtered_result = order_index & results;
    let final = filtered_result & Range(range.start, range.end);
    final.collect()
}

You can mimic that with indirection array(don't know how about memory thou) or HashMap now:

fn extract_sorted(&self, order_by: &OrderBy, results: BitSet, range: &Range<usize>) -> Vec<Rowid> {
    let order_index: &(BitSet, HashMap<u32, RowId>) = &self.order_indexes[order_by];
    let filtered_result = order_index.0 & results;
    let final = filtered_result & BitSetRange(range.start, range.end);
    final.iter().for_each(|index| order_index.1.get(index)).collect()
}

tower120 commented 6 months ago

Here custom 512bit implementation - hope it'll work fine https://github.com/tower120/hi_sparse_bitset/blob/config_512bit/examples/config_512.rs

hniksic commented 6 months ago

Note that I don't plan to use bitmaps for order index. Order index is pre-created with a more sophisticated variant of:

let name_order = (0..data.len()).sorted_by_key(|ind| data[ind].name);

So e.g. if you have data vec![{name:"foo",...}, {{name:"bar",...}, {name:"baz",...}], the name index would be name_order = vec![1, 2, 0]. Given a collection of rowids which are just indexes pointing to corresponding elements of data, one would sort it by name by using name_order[rowid] as the sort key. This is reasonably compact, efficient, and easy to create, while allowing decoupling of lookup and order indexes, which was needed for our use case.

tower120 commented 6 months ago

Hmm.... Yes I see. You'll need efficient BitSet::from(impl BitSetInetrafce) in that case. Should be actually quite fast.

Just to note. Due to specific of how BitSet works I would recommend if possible, still have BitSet with ordered RowId's. Not for ordering reason, but for purpose of reducing collected BitSet, and hence speeding up it construction. Like this:

fn extract_sorted(
     &self, 
     order_by: &OrderBy, 
    results: impl BitSetInterface /*consider that we did not collected intersection into BitSet yet, but have something like Reduce<...>*/, 
    range: &Range<usize>
) -> Vec<Rowid> {    
    let order_index: (&Vec<u32>, &BitSet) = &self.order_indexes[order_by];   

    // If your `results` already IS BitSet - skip this and have your initial implementation.
    // This bunch of intersections - to minimize resulting BitSet and speed up its construction.
    let results = BitSet::from( results & BitSetRange(range.start, range.end) & order_index.1);

    order_index.0.iter().filter(|&&rowid| results.contains(rowid)).take(range.end).skip(range.start).collect()
}

By the way, how many sets you expect to be intersected, and how many contains per extract_sorted you expect to be called?

tower120 commented 6 months ago

In this branch https://github.com/tower120/hi_sparse_bitset/tree/collect implemented BitSet::from(impl BitSetInterface) - which you'll need to make BitSet from lazy intersection in efficient way:

// make lazy intersection like this
let intersection = reduce(ops::And, bitsets.iter());
// or like this (slightly faster due to fixed bitsets count)
let intersection = apply(ops::And, &bitset1, &bitset2);
// or like this (same as apply)
let intersection = &bitset1 & &bitset2;

let bitset = BitSet::from(intersection)

Now with 512bit config you should have everything you need for your tests. Let me know if that does not work for you (there is some room for improvement).

hniksic commented 6 months ago

Just wanted to drop you a note that I will likely not be using hi_sparse_bitset at this point, as a combination of FixedBitSet and Vec<u32> (for few ids) is working nicely for me now. Thanks once again for providing the help, I might still get back to your library in the future.

I hope that this issue will still be useful to others who wish to use your library for representing bitsets with more than 2^24 distinct values.

I'd like to take some time to answer your questions about my use case, in case they turn out relevant for the design of your hi_sparse_bitset, or in case someone finds this by later searches.

By the way, how many sets you expect to be intersected

The typical number of intersections is relatively small, no more than 5 or so, with 1-3 being typical. But some of the bitsets being intersected are the results of unions, and the number of those unions can be in the hundreds. Take, for example, the following filter:

entities whose date is between 2023-01-01 and 2023-12-31
status == active
exec_system_type == linear

This requires intersecting three bitsets (i.e. performing two intersections), but computing the first bitset requires calculating a union of 365 bitsets, assuming every date has some entities with that date. Currently I do that with code that looks like this:

    fn lookup_ranged(&self, index_name: &str, range: impl RangeBounds<u32>, set: &mut FixedBitSet) {
        match &self.lookup_indexes[index_name] {
            ByValue::FewValues(index) => {
                for (_, rowids) in index.range(range) {
                    *set |= rowids;
                }
            }
            ByValue::ManyValues(index) => {
                set.extend(
                    index
                        .range(range)
                        .flat_map(|(_, rowids)| rowids.as_ref())
                        .map(|rowid| rowid.as_usize()),
                );
            }
        }
    }

This approach requires different code paths depending on whether the index has few or many distinct values. Here "few" is less than 100, an arbitrarily chosen limit, as exemplified by properties like "status", or "many" as in the case of "date" (and even more so for something like "datetime", which is not hard to imagine.) With few distinct values I can just keep the rowids in a bitset and use |= to collect them, whereas with many values I keep rowids in Vecs, and assemble a bitset from those vecs.

This is the area where I hoped roaring or hi_sparse_bitset would simplify the code by allowing the use of the same data structure for indexing regardless of how many values there are. But it turns out that, for now, this is the only place where I do this kind of thing, and in other places it's kind of ok to create "inefficient" FixedBitSets, as even with 30m entities each takes only 3.5MiB of memory, which is both acceptable and fast. Doing the small number of intersections is also pleasantly efficient, as the code likely triggers llvm's autovectorization and ends up being as fast as the underlying memory.

and how many contains per extract_sorted you expect to be called?

Worst case would be in the millions, more typical case would be in the thousands. This is because the frontend is typically requesting the first page with 50 or so results. So if the query filter selects e.g. 1% of entities (say 500k out of 50 million), then getting the first 50 means iterating through ~5000 entities by sorted index, in order to get the first 50. If the filters select few entities (less than 65k in the current code), the performance of contains() no longer matters because then I switch to getting the rowids into a vector and sorting them by their order as determined by the order index.

The worst case by far is having a filter that selects (say) 99% of entities, and then selecting something near the end. This is currently done by going through everything by ordered index, using contains() (on millions of entities) to filter out what doesn't belong there, and returning the requested range. One could of course micro-optimize the case of getting the last page by iterating the index in reverse, but a worst-case scenario would then still exist when requesting results from the middle. Blindingly efficient contains() is perfect for my use case because it keeps the code simple and makes even the worst cases reasonably efficient.

tower120 commented 6 months ago

I see - well good luck then.

If you'll consider using compressed bitsets in the future - I just would like to mention that I found a solution for configurable-depth hierarchical structure - so in future versions you'll can not only configure block width but level counts as well (from 2 to 4 in current prototype) - which would allow to config range from 4096 to 68_719_476_736.

P.S. I wonder that you do not have performance issues with intersection and union on plain bitsets. That issue driven me to hierarchical bitsets in the first place....

hniksic commented 6 months ago

P.S. I wonder that you do not have performance issues with intersection and union on plain bitsets. That issue driven me to hierarchical bitsets in the first place....

I think the secret is that the small number of intersections I do just happen to perform fast enough on my data size. And when I need a larger number of unions, I do by going through numbers pre-stored in a vector. It's far from free, but it allows a response of 4-30ms, which is great for my use case (and far far better than what we had using a somewhat simple-minded SQL-based approach).

Good luck with the library!