Closed vigna closed 1 year ago
Hi Sebastiano,
the current possibile values are listed here in the readme: https://github.com/jermp/pthash#enable-all-encoders,
i.e., partitioned_compact
, dictionary_dictionary
, and elias_fano
.
These three are a subset of those we benchmarked in the paper. To use all possibile encoders, just recompile with cmake .. -D PTHASH_ENABLE_ALL_ENCODERS=On
.
Anyway, I'm also updating the helper output of the build
command as you suggested. Thanks.
I see, I missed the README part about that because I read the build instructions and just built the thing. I think it would be very useful if a binary compiled without PTHASH_ENABLE_ALL_ENCODERS
would write in the help text after the basic encoder something like "For more encoders, recompile with PTHASH_ENABLE_ALL_ENCODERS=On".
Yes, indeed I'm adding this specification to the build command. Thank you again. If you have other questions, let me know!
-Giulio
Well, we are shopping for MPHF structures for 30B keys. I love the amazing access speed, but could you suggest given your knowledge which would be the best choice of parameters to get a reasonable construction time? I see that nonlinearity kicks in once you pass 1B keys.
One possibility would be to reimplement this using chunks, as you suggest in the paper, which would make things much faster because 1) reducing the effect of nonlinearity 2) parallel construction. At that point however you'd have another memory access (for the chunk delimiter) on which the other two would be dependent, and it is difficult to predict how that would impact performance.
Another question is: does build
load the keys in memory? It appears to do so, as I can't build a MPHF with a file of 1B 40-byte keys.
Well, we are shopping for MPHF structures for 30B keys. I love the amazing access speed, but could you suggest given your knowledge which would be the best choice of parameters to get a reasonable construction time? I see that nonlinearity kicks in once you pass 1B keys.
Oh, 30B! Nice. We tried up to 7.3B strings (3-grams from GoogleBooks.) You can let c scale with log(n) to have expected linear time construction. (I'm assuming n is the number of keys.) For example, let c = alpha/3 * log(n), for some alpha < 1.
One possibility would be to reimplement this using chunks, as you suggest in the paper, which would make things much faster because 1) reducing the effect of nonlinearity 2) parallel construction. At that point however you'd have another memory access (for the chunk delimiter) on which the other two would be dependent, and it is difficult to predict how that would impact performance.
You can also build a "partitioned" PTHash function. See the -p
option which governs the number of partitions to use.
In my opinion one should fix a reasonable partition size, say B=3M keys, and let p = n/B.
The partitioned version scales much better, also due to parallelization, as you pointed out.
Another question is: does
build
load the keys in memory? It appears to do so, as I can't build a MPHF with a file of 1B 40-byte keys.
Just build the function using the --external
flag which will use the external memory construction.
(We have external memory / parallel construction tested in another paper which I can share privately if interested.)
You can also build a "partitioned" PTHash function. See the
-p
option which governs the number of partitions to use. In my opinion one should fix a reasonable partition size, say B=3M keys, and let p = n/B. The partitioned version scales much better, also due to parallelization, as you pointed out.
Oh, so it's already implemented 😂. I should have read the paper more carefully. Do you have an idea of the impact on performance?
Just build the function using the
--external
flag which will use the external memory construction.
Willdo.
(We have external memory / parallel construction tested in another paper which I can share privately if interested.)
Yes, I'd love that.
Oh, so it's already implemented 😂. I should have read the paper more carefully. Do you have an idea of the impact on performance?
Already implemented yes, but it is not in the SIGIR 2021 paper. I'm sending you an email with the new paper.
Ok, I was really feeling stupid as I was sure I read it would have been nice to implement. Now everything makes sense.
ahah no problem at all! I sent you the paper. Let me know if I can assist you further.
Man, 1B keys, -p 1000 -t 20, 35s construction, 82 ns/key lookup. This is amazing.
You definitely have to define profiles, either in the documentation, or, better, in the build thing (like --fast, --small, etc.) as without interacting with you I would have never got here. There are just too many options. If you use it out of the box you don't get the good results.
Now I just have to port this thing to Rust 😂.
Thank you Sebastiano! Yes, you're right. We will point out to some explicit configurations then.
Rust...I'm reading "The Book" at the moment and I definitely want to port PTHash to Rust, as well as SSHash, LPHash and Fulgor!
Also, see these examples https://github.com/jermp/pthash#example-2 in the readme for more configurations.
(Done as of f3efac63044ac0d8763bc5dad549cfda39b7c430.)
I'm trying to build maps using build but the process of choosing the encoder is extremely frustrating. The help text suggest to look at the source file, which contains no documentation. I tried structure names by trial and error (e.g.,
elias_fano
), but, for example,compact_compact
doesn't work. The possible values of-e
must should be documented.