prashnts / pybloomfiltermmap3

Fast Python Bloom Filter using Mmap
https://github.com/prashnts/pybloomfiltermmap3
MIT License
126 stars 22 forks source link

Add xxHash hashing algorithm #15

Open mizvyt opened 4 years ago

mizvyt commented 4 years ago

Requested here a while ago:

Currently the fastest hashing algorithm. Added as default, but can be changed if needed.

@prashnts

mizvyt commented 4 years ago

Test issue re: https://github.com/prashnts/pybloomfiltermmap3/pull/11#issuecomment-546195573

prashnts commented 4 years ago

I'd avoid changing defaults without bumping to a major version, since it can break persistent filters for existing users.

Curiously, xxhash is licensed under BSD 2-Clause. I don't know if it lets us release under current MIT license. Do you have any idea?

Regarding new xxhash, I noticed that it is also available as a on pypi -- and I'm wondering if it's possible to add it as a dependency instead. What do you think?

Also wondering if there's any real performance benefit with it -- did you see any benchmark?

mizvyt commented 4 years ago

Curiously, xxhash is licensed under BSD 2-Clause. I don't know if it lets us release under current MIT license. Do you have any idea?

Shouldn't be an issue. It's similar to MIT license, so it doesn't implicate anything to this library, only prerequisite is to contain a copyright notice for xxHash, which is already customized and prepended to each file.

Also wondering if there's any real performance benefit with it -- did you see any benchmark? Benchmarks for xxHash are here for some reference systems: https://github.com/Cyan4973/xxHash/blob/dev/README.md

I'm assuming since this library uses the same interface and procedure for each hashing algorithm, there shouldn't be an issue to test it and we should see an improvement, albeit I haven't created any benchmarks for it...

I'd avoid changing defaults without bumping to a major version, since it can break persistent filters for existing users.

You're right. I think a better way would be to use it as a Python package extension (along with the other two hashing algorithms), such that when we want to install pybloomfilter with xxHash, we do pip install pybloomfilter[xxhash].

Regarding new xxhash, I noticed that it is also available as a on pypi -- and I'm wondering if it's possible to add it as a dependency instead. What do you think?

I think it's just a Python wrapper for the library, rather than the actual library? In the repo, they use the xxHash repo as a submodule: https://github.com/ifduyue/python-xxhash/tree/master/deps

I pulled the xxHash code for the latest release (0.7.2), but I agree it would be better to have some dependency management. Of course, same goes for other hashing algorithms, but let's keep it out of scope for now.

I'll stash this PR for now? Will keep it in my branch and rebase with other changes, and give a thought on a more elegant solution.

prashnts commented 4 years ago

I haven't created any benchmarks for it...

Been there, done that! I think it's a nice hashing method from what I saw in the code (Once upon a time I ported city-hash in Python (link) so it'd be nice to have "pluggable" hash methods).

Regarding submodule: It usually complicates stuff (eg. if you forget --recursive flag while cloning). Given that it's not particularly huge code, I think copying it here is better, since we can also adapt them if required.

Let's keep it open and once you're sure it doesn't break things from your branch, we can work to get new release out. Sounds good?

mizvyt commented 4 years ago

Been there, done that! I think it's a nice hashing method from what I saw in the code (Once upon a time I ported city-hash in Python (link) so it'd be nice to have "pluggable" hash methods).

Let's keep it open and once you're sure it doesn't break things from your branch, we can work to get new release out. Sounds good?

Definitely!

prashnts commented 4 years ago

How confident are we that this is correct ? xD