jermp / pthash

Fast and compact minimal perfect hash functions in C++.
MIT License
190 stars 25 forks source link

C API #12

Closed ozgrakkurt closed 10 months ago

ozgrakkurt commented 1 year ago

Hello!

Very nice project and research (I don't understand fully but seems very interesting), I wanted to port this to Rust but seems like it will be a lot of work.

I though might be easier if it had a c api so I can call it with ffi from rust.

Is this possible to do?

I can work on it if you give me some direction.

jermp commented 1 year ago

Hello, and thank you for the kind words!

A Rust porting is something I have consider writing but I first have to learn Rust, so it would not appear soon unless someone else is going to do it.

We can work though on a C api, yes, assuming you only need the signature of the functions to be in pure C but the function itself will then call the "read" C++ function as written in PTHash.

Best, -Giulio

ozgrakkurt commented 1 year ago

makes sense, trying to add c header file. Also want to write rust port

jermp commented 1 year ago

Thank you! There are several people (including me, of course) who are interested in porting PTHash to Rust. I can connect you with them if interested and work together.

Best, -Giulio

snizovtsev commented 1 year ago

The construction algorithm is actually pretty simple. I have recently rewrote it from scratch for my (abandoned) project. Instead of C API it's better to just write a whole algorithm in Rust. I've made this for Apache Arrow C++ and found it is simple enough to fit into a few steps:

1) Hash key columns into 2 vectors of size N: hash1(k), hash2(k); 2) Compute a load factor for each bucket: count[skew_bucketer(hash1_i)]++;; 3) Group buckets by descending count, i.e. make a nested list of indices [[b1,b2,b3], [b4], ... [bm]] where count[b1]=count[b2] and so on (you can encode a nested list as 2 flat vectors); 4) Reorder hash2 vector according to bucket bins: [[hash2 elements for b1], [hash2 elements for b2], ..., [hash2 elements for bm]]. This is another List[List[UInt64]] that could be encoded as 2 flat vectors; 5) Now iterate that bins and pick a pilot value for each of them. That's it.

jermp commented 1 year ago

@snizovtsev I’m glad you found PTHash useful and easy to implement! I would be grateful if you could point out for what application you used PTHash. Thanks! -Giulio