vigna / sux

Succinct data structures in C/C++
Other
82 stars 17 forks source link

Are they any bindings for Golang? #10

Closed AlexeyAkhunov closed 3 years ago

AlexeyAkhunov commented 3 years ago

I would like to use this library (specifically RecSplit for creating MPHFs) for some experiments in our Erigon project (if you are interested I can describe the possible use case): https://github.com/ledgerwatch/erigon But I do need Golang binding to do so. I assume there are no bindings yet, and I was thinking about creating a bounty for someone to develop it.

AlexeyAkhunov commented 3 years ago

Actually, after code review and experiments, I realised we would need to modify the implementation quite a bit:

Firstly, we cannot assume that apply non-cryptographic hash function to the keys and effectively taking low 64-bit (this is what I discovered the implementation really does) as a fingerprint, would work. Because it would be easy to break such algorithm by creating collision in such fingeprints (as we are operating in adversarial environment). Secondly, we need to handle the case where all keys do not fit in RAM. But in principle the approach should work.

vigna commented 3 years ago

So, I'm reading this just now. It should be easy to replace the hash function with a crypto-level one. For keys not fitting in RAM do you mean the hashes or the keys themselves? There's no need for the keys to be in RAM. One can even sort the hashes offline, but that's more work (stxxl?).

AlexeyAkhunov commented 3 years ago

@vigna thanks for replying. I have started re-implementing it in golang. We have what we call ETL (Extract Transform Load) framework in our library to be able to sort keys by bucket numbers "offline" (potentially using temporary files). My main issue was to make sure we do not generate duplicates when applying initial hash (first_hash in your code) to the keys. So instead I am planning to bring full-sized keys to the procedure for each bucket, instead of starting with uint64 fingerprints: https://github.com/ledgerwatch/erigon-lib/pull/67

vigna commented 3 years ago

But wait, first_hash is made of 128 bits. The probability of collision is entirely negligible. I don't understand your point about 64-bit fingerprints.

AlexeyAkhunov commented 3 years ago

But wait, first_hash is made of 128 bits. The probability of collision is entirely negligible. I don't understand your point about 64-bit fingerprints.

I meant that when the bucket of keys arrive into the recSplit function, it is vector<uint64>, so it is more chance of having collisions. And we cannot rely on low probability of collisions in our use case, because someone could try to find collision and introduce it just to stop our system from working. That is why I am trying to bring full sized keys into the equivalent of recSplit function, because it then deals with the collisions by definition (by finding the functions that does not give collisions)

vigna commented 3 years ago

But the bucket is maximum ≈2000 keys. The probability of collision of 2000 64-bit keys is still entirely negligible. If the first function is crypto, the keys are randomly chosen.