jermp / pthash

Fast and compact minimal perfect hash functions in C++.
MIT License
179 stars 25 forks source link

Add method internal_memory_builder_partitioned_phf::build_from_hashes #55

Closed progval closed 2 months ago

progval commented 2 months ago

This mirrors internal_memory_builder_single_phf::build_from_hashes

jermp commented 2 months ago

Also this one is not actually needed...

PS. Are you using PTHash for some of your projects? I'm curious.

progval commented 2 months ago

I'm working with @vigna on porting https://docs.softwareheritage.org/devel/swh-graph/ to Rust and we're considering switching from GOV and GOV3 to PTHash. So right now I'm writing Rust bindings for PTHash here: https://gitlab.softwareheritage.org/swh/devel/pthash-rs

The reason I'm interested in build_from_hashes is that I effectively can't call templated C++ functions from Rust, except through a function with a concrete type. This prevents me from giving users of the Rust crate the ability to pass me arbitrary key iterators I could pass directly to internal_memory_builder_partitioned_phf::build_from_keys.

So instead I build an array of hashes (which is usually smaller than an array of keys, so it's okay-ish to keep it in RAM) and pass it to to a concrete-ized internal_memory_builder_partitioned_phf::build_from_hashes, like this one: https://gitlab.softwareheritage.org/swh/devel/pthash-rs/-/blob/b28d91baf80a15998e8b10ecda41e1b7d716d26e/src/backends.rs#L57-62

I might add a few more concrete-izations later, eg. to store the hashes on disk, but I'm limited to only providing a handful of them because I have to explicitly list them in the binding

jermp commented 2 months ago

Oh that makes a lot of sense then. I'll merge the PR.

You might also be interested in PTHash-PHOBIC: https://github.com/jermp/pthash/tree/phobic. PHOBIC adds some optimizations to PTHash regarding space usage and construction under space-efficient configurations. The pre-print is forthcoming.

Thanks!