jermp / pthash

Fast and compact minimal perfect hash functions in C++.
MIT License
190 stars 25 forks source link

"std::runtime_error ... using too many buckets: change bucket_id_type to uint64_t or use a smaller " with c >= 5 and 34*10^9 keys #16

Closed zacchiro closed 10 months ago

zacchiro commented 10 months ago

Hello, I'm trying to use pthash on ~34*10^9 strings (~50 bytes each) and I'm hitting the runtime error in title when using the main executable like this:

$ ./build --minimal -c 5 -a 0.99 -e dictionary_dictionary -n 34121566250 -i graph.nodes.csv -o graph.nodes.mph -t 48 --external -m 2048 --verbose --check --lookup
2023-11-25 16:39:19: construction starts
terminate called after throwing an instance of 'std::runtime_error'
  what():  using too many buckets: change bucket_id_type to uint64_t or use a smaller c

It happens for c >= 5, but not with c = 4.

Does one need to recompile with the suggested type change to use higher c on such an input set?

Thanks a lot for pthash, it's amazing :-)

zacchiro commented 10 months ago

With the following oneliner change I could indeed start a PTHash search with larger values of c for this dataset:

diff --git a/include/builders/util.hpp b/include/builders/util.hpp
index 7045492..c994253 100644
--- a/include/builders/util.hpp
+++ b/include/builders/util.hpp
@@ -8,7 +8,7 @@

 namespace pthash {

-typedef uint32_t bucket_id_type;
+typedef uint64_t bucket_id_type;
 typedef uint8_t bucket_size_type;
 constexpr bucket_size_type MAX_BUCKET_SIZE = 100;

Would it be possible to make this compile-time configurable, rather than relying on users to patch the code?

jermp commented 10 months ago

Hi @zacchiro, thank you for your interest in PTHash and your kind words. My pleasure to assist you.

Using a different bucket_id_type is indeed the way to go: that's why I used a typedef there :) Your suggestion of making it compile-configurable makes perfect sense. I expect this to be used by experienced users.

You might also want to try a partitioned construction, by specifying the parameter -p [num_partitions]. I would do it as follows:

$ ./build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 34121566250 -i graph.nodes.csv -o graph.nodes.mph -t 48 --external -m 128 -p 3500 --verbose --check --lookup
jermp commented 10 months ago

Done, as of e3a6b3fde2a70cc86e8015d7b86026d5b619cb81. Thanks!

Closing this. Feel free to drop me a line (giulioermanno.pibiri@unive.it) or open a new issue if needed.

Best, -Giulio

zacchiro commented 10 months ago

great, thanks!