Closed khodor14 closed 9 months ago
Hello, before answering, why are you tagging @ByteHamster in this project? :D Surely, you do not want to bother him with PTHash.
Anyway, you should never build a MPHF for 19 keys. Why would you need it? MPHFs are meant for at least hundreds of thousands of keys.
Sorry for that but it was recommended while writing the issue.
Anyway, I'm separating k-mers into buckets and building one mphf/buckets. I happened to get some small buckets and in one of these cases pthash stucked.
If you need a MPHF over kmers, you may want to consider this: https://github.com/jermp/lphash -- LPHash.
Or probably you're using too many buckets, try with less buckets. In order to obtain 2-3 bits/key, there is a required minimal size to amortized the space overhead of the MPHF (e.g., a std::vector
holds at least 8 bytes of overhead, for the size, and this enough to add 64/19=3.37 bits/key of oeverhead).
Thank you alot and sorry for bothering you
No worries at all, you're very welcome! Feel free to tell me what you're trying to do, so that I can better advise.
Best, -Giulio
@jermp Hello again,
Back to the number of buckets as a parameter of pthash, do you recommend a specific value in terms of the number of keys especially when the number of keys is small (e.g. 10000 keys or less).
Thank you
Hi @khodor14, if the number of keys is small, you do not need to split the set into buckets, no? Why would you do that? Building several MPHFs, one per bucket, is useful only when you have very large key sets (e.g., billions of keys) and you want to scale up with multiple threads. In that case, I would recommend to choose the number of buckets as n / 1000000 or something similar.
Let me know if that makes sense to you. Best, -Giulio
@jermp What I mean by the number bucket is pthash parameter.
Anyway, I'm separating k-mers into buckets and building one mphf/buckets. I happened to get some small buckets and in one of these cases pthash stucked.
I thought you were referring to your use case above.
PTHash does not accept a parameter controlling the exact number of buckets to use. However, the number of buckets can be controlled using the parameter c
. For small key sets, probably you want to stick with a small c
, say c=4
or c=5
.
But that should be determined experimentally for your application.
Hello again @jermp
What could be the reason for pthash to stuck while creating a mphf (see the log below)?
c = 6 alpha = 0.94 num_keys = 19 table_size = 20 num_buckets = 27 == map+sort took: 0 seconds == merged 100% pairs == merge+check took: 0 seconds == mapping+ordering took 0 seconds == max bucket size = 6 == lambda = 0.707988 == lambda_1 = 1.41598 == lambda_2 = 0.404565 == num_buckets of size 6 = 1 (estimated with Poisson = 0) == num_buckets of size 5 = 0 (estimated with Poisson = 0) == num_buckets of size 4 = 0 (estimated with Poisson = 0) == num_buckets of size 3 = 0 (estimated with Poisson = 1) == num_buckets of size 2 = 5 (estimated with Poisson = 3) == num_buckets of size 1 = 3 (estimated with Poisson = 7) 2023-10-16 15:05:45: search starts
it could be the configuration for small number of keys. If such a case, what do you recommend? Thanks