Closed bigerl closed 2 years ago
Is 2^32 entries really not enough? For uint64_t
key & value plus bucket overhead filling up all 2^32 entries will require at least 24*2^32 ~ 100 gigabyte of RAM. Is that really too small for your application?
I agree that it is enough for most applications.
An example for a problematic use-case would be storing an index for nodes in Wikidata. Wikidata contains more than 2^32 distinct nodes. So, storing a mapping from uint64_t
ID to some RDF node object wouldn't work.
We can't win on all counts. Extreme cases should not decrease the performance of the common case.
We can't win on all counts. Extreme cases should not decrease the performance of the common case.
Right, but I still think it would be nice to at least optionally allow such extreme use cases.
@bigerl by the way, only allowing 3 bits for dist
is certainly not enough. If dist
gets an overflow the behavior is undefined. That's why it's as large as 24bit.
I have a version that should work for you: https://github.com/martinus/unordered_dense/blob/2022-08-custom-bucket-types/include/ankerl/unordered_dense.h
the bucket's type can now be customized. E.g. this is how you can use the big bucket type:
using MapBig = ankerl::unordered_dense::map<std::string,
size_t,
ankerl::unordered_dense::hash<std::string>,
std::equal_to<std::string>,
std::allocator<std::pair<std::string, size_t>>,
ankerl::unordered_dense::bucket_type::big>;
Bucket size will be 12 byte, and max_size is 2^64-1. Would this work for you?
That works. Thanks for the fast solution. I like the idea of making it configurable.
Just a final note: I would expect the big variant to perform better with clang than with gcc. I had in the past several cases where clang handled structs with sizes that are not multitudes of a machine word better.
Thank you for sharing this hash-map implementation. It's a joy to read through the code.
Is your feature request related to a problem? Please describe. Currently, because of
Bucket.value_idx
being auint32_t
the map/set can store at max 2^32 entries. That is quite tight. Many applications will hit that limit. In java, where collections have a similar limitation, I run into it regularly.Describe the solution you'd like The readme says that only 1 Byte + 3 Bits of
Bucket.dist_and_fingerprint
are payload. If the remaining 21 Bits would be used to extendBucket.value_idx
, much more entries could be stored (up to 2^(32+21) ~ 64 PB of uint64_t). I would suggest something like:Describe alternatives you've considered Instead of non-standard attribute packing, standard bit masks and shifts can be used.