RFC: Allow more than 2^32 entries

martinus / unordered_dense

A fast & densely stored hashmap and hashset based on robin-hood backward shift deletion

MIT License

898 stars 72 forks source link

RFC: Allow more than 2^32 entries #16

Closed bigerl closed 2 years ago

bigerl commented 2 years ago

Thank you for sharing this hash-map implementation. It's a joy to read through the code.

Is your feature request related to a problem? Please describe. Currently, because of Bucket.value_idx being a uint32_t the map/set can store at max 2^32 entries. That is quite tight. Many applications will hit that limit. In java, where collections have a similar limitation, I run into it regularly.

Describe the solution you'd like The readme says that only 1 Byte + 3 Bits of Bucket.dist_and_fingerprintare payload. If the remaining 21 Bits would be used to extend Bucket.value_idx, much more entries could be stored (up to 2^(32+21) ~ 64 PB of uint64_t). I would suggest something like:

struct __attribute__((__packed__)) Bucket {
            uint8_t dist:3;
            uint8_t fingerprint:8;
            uint64_t value_idx:53;
        };

Describe alternatives you've considered Instead of non-standard attribute packing, standard bit masks and shifts can be used.

martinus commented 2 years ago

Is 2^32 entries really not enough? For uint64_t key & value plus bucket overhead filling up all 2^32 entries will require at least 24*2^32 ~ 100 gigabyte of RAM. Is that really too small for your application?

bigerl commented 2 years ago

I agree that it is enough for most applications.

An example for a problematic use-case would be storing an index for nodes in Wikidata. Wikidata contains more than 2^32 distinct nodes. So, storing a mapping from uint64_t ID to some RDF node object wouldn't work.

Philippe91 commented 2 years ago

We can't win on all counts. Extreme cases should not decrease the performance of the common case.

martinus commented 2 years ago

We can't win on all counts. Extreme cases should not decrease the performance of the common case.

Right, but I still think it would be nice to at least optionally allow such extreme use cases.

@bigerl by the way, only allowing 3 bits for dist is certainly not enough. If dist gets an overflow the behavior is undefined. That's why it's as large as 24bit.

martinus commented 2 years ago

I have a version that should work for you: https://github.com/martinus/unordered_dense/blob/2022-08-custom-bucket-types/include/ankerl/unordered_dense.h

the bucket's type can now be customized. E.g. this is how you can use the big bucket type:

using MapBig = ankerl::unordered_dense::map<std::string,
                                            size_t,
                                            ankerl::unordered_dense::hash<std::string>,
                                            std::equal_to<std::string>,
                                            std::allocator<std::pair<std::string, size_t>>,
                                            ankerl::unordered_dense::bucket_type::big>;

Bucket size will be 12 byte, and max_size is 2^64-1. Would this work for you?

bigerl commented 2 years ago

That works. Thanks for the fast solution. I like the idea of making it configurable.

Just a final note: I would expect the big variant to perform better with clang than with gcc. I had in the past several cases where clang handled structs with sizes that are not multitudes of a machine word better.