jermp / pthash

Fast and compact minimal perfect hash functions in C++.
MIT License
190 stars 25 forks source link

Encoders must be documented in the help text #11

Closed vigna closed 1 year ago

vigna commented 1 year ago

I'm trying to build maps using build but the process of choosing the encoder is extremely frustrating. The help text suggest to look at the source file, which contains no documentation. I tried structure names by trial and error (e.g., elias_fano), but, for example, compact_compact doesn't work. The possible values of -emust should be documented.

jermp commented 1 year ago

Hi Sebastiano, the current possibile values are listed here in the readme: https://github.com/jermp/pthash#enable-all-encoders, i.e., partitioned_compact, dictionary_dictionary, and elias_fano. These three are a subset of those we benchmarked in the paper. To use all possibile encoders, just recompile with cmake .. -D PTHASH_ENABLE_ALL_ENCODERS=On.

Anyway, I'm also updating the helper output of the build command as you suggested. Thanks.

vigna commented 1 year ago

I see, I missed the README part about that because I read the build instructions and just built the thing. I think it would be very useful if a binary compiled without PTHASH_ENABLE_ALL_ENCODERS would write in the help text after the basic encoder something like "For more encoders, recompile with PTHASH_ENABLE_ALL_ENCODERS=On".

jermp commented 1 year ago

Yes, indeed I'm adding this specification to the build command. Thank you again. If you have other questions, let me know!

-Giulio

vigna commented 1 year ago

Well, we are shopping for MPHF structures for 30B keys. I love the amazing access speed, but could you suggest given your knowledge which would be the best choice of parameters to get a reasonable construction time? I see that nonlinearity kicks in once you pass 1B keys.

One possibility would be to reimplement this using chunks, as you suggest in the paper, which would make things much faster because 1) reducing the effect of nonlinearity 2) parallel construction. At that point however you'd have another memory access (for the chunk delimiter) on which the other two would be dependent, and it is difficult to predict how that would impact performance.

vigna commented 1 year ago

Another question is: does build load the keys in memory? It appears to do so, as I can't build a MPHF with a file of 1B 40-byte keys.

jermp commented 1 year ago

Well, we are shopping for MPHF structures for 30B keys. I love the amazing access speed, but could you suggest given your knowledge which would be the best choice of parameters to get a reasonable construction time? I see that nonlinearity kicks in once you pass 1B keys.

Oh, 30B! Nice. We tried up to 7.3B strings (3-grams from GoogleBooks.) You can let c scale with log(n) to have expected linear time construction. (I'm assuming n is the number of keys.) For example, let c = alpha/3 * log(n), for some alpha < 1.

One possibility would be to reimplement this using chunks, as you suggest in the paper, which would make things much faster because 1) reducing the effect of nonlinearity 2) parallel construction. At that point however you'd have another memory access (for the chunk delimiter) on which the other two would be dependent, and it is difficult to predict how that would impact performance.

You can also build a "partitioned" PTHash function. See the -p option which governs the number of partitions to use. In my opinion one should fix a reasonable partition size, say B=3M keys, and let p = n/B. The partitioned version scales much better, also due to parallelization, as you pointed out.

Another question is: does build load the keys in memory? It appears to do so, as I can't build a MPHF with a file of 1B 40-byte keys.

Just build the function using the --external flag which will use the external memory construction.

(We have external memory / parallel construction tested in another paper which I can share privately if interested.)

vigna commented 1 year ago

You can also build a "partitioned" PTHash function. See the -p option which governs the number of partitions to use. In my opinion one should fix a reasonable partition size, say B=3M keys, and let p = n/B. The partitioned version scales much better, also due to parallelization, as you pointed out.

Oh, so it's already implemented 😂. I should have read the paper more carefully. Do you have an idea of the impact on performance?

Just build the function using the --external flag which will use the external memory construction.

Willdo.

(We have external memory / parallel construction tested in another paper which I can share privately if interested.)

Yes, I'd love that.

jermp commented 1 year ago

Oh, so it's already implemented 😂. I should have read the paper more carefully. Do you have an idea of the impact on performance?

Already implemented yes, but it is not in the SIGIR 2021 paper. I'm sending you an email with the new paper.

vigna commented 1 year ago

Ok, I was really feeling stupid as I was sure I read it would have been nice to implement. Now everything makes sense.

jermp commented 1 year ago

ahah no problem at all! I sent you the paper. Let me know if I can assist you further.

vigna commented 1 year ago

Man, 1B keys, -p 1000 -t 20, 35s construction, 82 ns/key lookup. This is amazing.

You definitely have to define profiles, either in the documentation, or, better, in the build thing (like --fast, --small, etc.) as without interacting with you I would have never got here. There are just too many options. If you use it out of the box you don't get the good results.

Now I just have to port this thing to Rust 😂.

jermp commented 1 year ago

Thank you Sebastiano! Yes, you're right. We will point out to some explicit configurations then.

Rust...I'm reading "The Book" at the moment and I definitely want to port PTHash to Rust, as well as SSHash, LPHash and Fulgor!

jermp commented 1 year ago

Also, see these examples https://github.com/jermp/pthash#example-2 in the readme for more configurations.

jermp commented 1 year ago

(Done as of f3efac63044ac0d8763bc5dad549cfda39b7c430.)