jermp / pthash

Fast and compact minimal perfect hash functions in C++.
MIT License
182 stars 26 forks source link

pthash stucks while creating mphf #14

Closed khodor14 closed 9 months ago

khodor14 commented 9 months ago

Hello again @jermp

What could be the reason for pthash to stuck while creating a mphf (see the log below)?

c = 6 alpha = 0.94 num_keys = 19 table_size = 20 num_buckets = 27 == map+sort took: 0 seconds == merged 100% pairs == merge+check took: 0 seconds == mapping+ordering took 0 seconds == max bucket size = 6 == lambda = 0.707988 == lambda_1 = 1.41598 == lambda_2 = 0.404565 == num_buckets of size 6 = 1 (estimated with Poisson = 0) == num_buckets of size 5 = 0 (estimated with Poisson = 0) == num_buckets of size 4 = 0 (estimated with Poisson = 0) == num_buckets of size 3 = 0 (estimated with Poisson = 1) == num_buckets of size 2 = 5 (estimated with Poisson = 3) == num_buckets of size 1 = 3 (estimated with Poisson = 7) 2023-10-16 15:05:45: search starts it could be the configuration for small number of keys. If such a case, what do you recommend? Thanks

jermp commented 9 months ago

Hello, before answering, why are you tagging @ByteHamster in this project? :D Surely, you do not want to bother him with PTHash.

Anyway, you should never build a MPHF for 19 keys. Why would you need it? MPHFs are meant for at least hundreds of thousands of keys.

khodor14 commented 9 months ago

Sorry for that but it was recommended while writing the issue.

Anyway, I'm separating k-mers into buckets and building one mphf/buckets. I happened to get some small buckets and in one of these cases pthash stucked.

jermp commented 9 months ago

If you need a MPHF over kmers, you may want to consider this: https://github.com/jermp/lphash -- LPHash. Or probably you're using too many buckets, try with less buckets. In order to obtain 2-3 bits/key, there is a required minimal size to amortized the space overhead of the MPHF (e.g., a std::vector holds at least 8 bytes of overhead, for the size, and this enough to add 64/19=3.37 bits/key of oeverhead).

khodor14 commented 9 months ago

Thank you alot and sorry for bothering you

jermp commented 9 months ago

No worries at all, you're very welcome! Feel free to tell me what you're trying to do, so that I can better advise.

Best, -Giulio

khodor14 commented 8 months ago

@jermp Hello again,

Back to the number of buckets as a parameter of pthash, do you recommend a specific value in terms of the number of keys especially when the number of keys is small (e.g. 10000 keys or less).

Thank you

jermp commented 8 months ago

Hi @khodor14, if the number of keys is small, you do not need to split the set into buckets, no? Why would you do that? Building several MPHFs, one per bucket, is useful only when you have very large key sets (e.g., billions of keys) and you want to scale up with multiple threads. In that case, I would recommend to choose the number of buckets as n / 1000000 or something similar.

Let me know if that makes sense to you. Best, -Giulio

khodor14 commented 8 months ago

@jermp What I mean by the number bucket is pthash parameter.

jermp commented 8 months ago

Anyway, I'm separating k-mers into buckets and building one mphf/buckets. I happened to get some small buckets and in one of these cases pthash stucked.

I thought you were referring to your use case above.

PTHash does not accept a parameter controlling the exact number of buckets to use. However, the number of buckets can be controlled using the parameter c. For small key sets, probably you want to stick with a small c, say c=4 or c=5. But that should be determined experimentally for your application.