feat: add Python bindings

ogxd / gxhash

The fastest hashing algorithm 📈

https://docs.rs/gxhash

MIT License

799 stars 27 forks source link

feat: add Python bindings #108

Open winstxnhdw opened 2 weeks ago

winstxnhdw commented 2 weeks ago

Summary

This PR adds Python bindings with type hints for gxhash32, gxhash64 and gxhash128. Only cargo is required. Currently, it's able to hash a 5 GB file in 0.7s.

Closes #97.

Demo

pip install "gxhash @ git+https://git@github.com/winstxnhdw/gxhash.git#subdirectory=py-gxhash"

from gxhash import gxhash32, gxhash32_nogil, gxhash64, gxhash64_nogil, gxhash128, gxhash128_nogil

gxhash32("hello".encode(), 0)
gxhash32_nogil("hello".encode(), 0)
gxhash64("hello".encode(), 0)
gxhash64_nogil("hello".encode(), 0)
gxhash128("hello".encode(), 0)
gxhash128_nogil("hello".encode(), 0)

Todo

[ ] Upload wheels to pypi
[ ] CI for compiling wheels
[x] Release GIL
[ ] Docs explaining the performance implications for each hash variant

winstxnhdw commented 2 weeks ago

Should we consider creating a function with their GIL-less counterpart like gxhash32_nogil? Although dropping the GIL is effectively just flipping a boolean variable, our single-threaded users may think that it's an unnecessary performance overhead on them.

But more importantly, is gxhash even thread-safe?

winstxnhdw commented 2 weeks ago

Perhaps we should also consider moving the py-gxhash directory into ffi/python and the existing C FFI to ffi/c.

ogxd commented 1 week ago

Hello @winstxnhdw,

I tried installing py-gxhash via pip using your command but I get this error:

ERROR: Could not find a version that satisfies the requirement version-subpkg (unavailable) (from versions: none)
ERROR: No matching distribution found for version-subpkg (unavailable)

Am I missing something? I am using python 3.13

But more importantly, is gxhash even thread-safe?

Gxhash has no shared or global state. Any state is internal to the method scope (thus thread local). The input data is passed by reference, so it must not be mutated while gxhash is reading from it, or the produced hash will be undefined, but it won't crash nor hang. You can indeed drop the GIL if you wish.

winstxnhdw commented 1 week ago

Am I missing something? I am using python 3.13

Looks like your shell is not escaping the URL correctly. I have updated the installation instructions. This one should work.

pip install "gxhash @ git+https://git@github.com/winstxnhdw/gxhash.git#subdirectory=py-gxhash"

winstxnhdw commented 1 week ago

I noticed that the docs uses the specific phrase arbitrary stream of bytes, but this is not technically correct, isit? None of the gxhash* functions can accept a stream.

winstxnhdw commented 1 week ago

The current async function is more expensive than running it synchronously even for 5 GB files. Using nogil + Python threads is more than 20x faster.

Perhaps instead of spawning a tokio thread, we should pass a Python thread to Rust instead.

Also, it's really interesting that repeated calls are many magnitudes faster than the first call. I guess this is the cache being populated and then hit. With so much going on in Python, I always assumed that the cache would be evicted quickly.

ogxd commented 1 week ago

Looks like your shell is not escaping the URL correctly. I have updated the installation instructions. This one should work.

Indeed, it almost did! Thank you :).
One remaining issue is that using the hybrid flag does not build on ARM (I am on a Macbook M1) because this feature is reserved for x86 + nightly rust. I changed the behavior to ignore the flag in such cases instead of not building. I also used the no_std crate since Hasher and other stuff is unnecessary here. Feel free to copy or cherry pick.

The current async function is more expensive than running it synchronously even for 5 GB files. Using nogil + Python threads is more than 20x faster.

Perhaps instead of spawning a tokio thread, we should pass a Python thread to Rust instead.

I don't think it makes a lot of sense to expose an async version of gxhash. Async / await is usually reserved for network-bound or sometimes even disk-bound operation so that a thread is not frozen while waiting for the response. Here gxhash is purely CPU-bound. Suppose in a given context the hashing time is substantial enough that you think one may want to execute it asynchronously. In that case, it also means there is a substantial amount of CPU work delegated to a threadpool thread (from tokio or python), which is not something you want (it may cause threadpool starvation). In such case what you want is delegate the work to a dedicated background thread, and you wouldn't use async / await (instead invoke thread and send / consume work via queue for instance).

I noticed that the docs uses the specific phrase arbitrary stream of bytes, but this is not technically correct, isit? None of the gxhash* functions can accept a stream.

It is correct, however this documentation is on the Hasher trait implementation which is a special Rust construct to build a hash pieces by pieces. The mention of arbitrary stream of bytes is actually a copy paste from the official documentation of that trait: https://doc.rust-lang.org/std/hash/trait.Hasher.html

This PR adds Python bindings with type hints for gxhash32, gxhash64 and gxhash128. Only cargo is required. Currently, it's able to hash a 5 GB file in 0.7s.

It's really awesome! I was surprised to see how little it would require locally in order to make this work, congrats! At this point I am wondering if the future of this belongs to a subfolder (like you did) or a dedicated repository, the reasons being:

We also have implements or bindings for other languages
- There is a full C# implementation
- I have pieces of a C implementation somewhere, and I received feedback from people willing to work on this.
- We are about to have python bindings thanks to your work
Having a dedicated repository for the python bindings allows us to make a README dedicated to python users (with less Rust stuff) and we also keep the rust implementation repository free from Python stuff which most Rust users may not be looking for.
It also simplifies CI
It's easier regarding ownership (you can become one of the maintainers of the gxhash-py if you wish)

If you like this second option please let me know. That would imply moving transferring this repo to a "gxhash" organization and name it gxhash-rs, transfer gxhash-csharp and then finally gxhash-py.

winstxnhdw commented 1 week ago

One remaining issue is that using the hybrid flag does not build on ARM (I am on a Macbook M1) because this feature is reserved for x86 + nightly rust. I changed the behavior to ignore the flag in such cases instead of not building. I also used the no_std crate since Hasher and other stuff is unnecessary here.

Ah, right. I'll be sure to handle that.

I don't think it makes a lot of sense to expose an async version of gxhash. Async / await is usually reserved for network-bound or sometimes even disk-bound operation so that a thread is not frozen while waiting for the response. Here gxhash is purely CPU-bound. Suppose in a given context the hashing time is substantial enough that you think one may want to execute it asynchronously. In that case, it also means there is a substantial amount of CPU work delegated to a threadpool thread (from tokio or python), which is not something you want (it may cause threadpool starvation). In such case what you want is delegate the work to a dedicated background thread, and you wouldn't use async / await (instead invoke thread and send / consume work via queue for instance).

The reason for exposing an async version is because I'd imagine gxhash being used a lot in an asynchronous context. More specifically, in web servers. Running a synchronous function, regardless of whether it has dropped the GIL, within an async route handler will block the event loop from handling concurrent requests. If it takes 0.5s for a request to hash a file, that's 0.5s that the server will not be able to respond to any other requests. I have heard other developers mention that even 100 microseconds of blocking is unacceptable. From how I see it, there's no difference between how I would handle a synchronous IO-bound task and a CPU-bound task. As long as neither saturates CPU or memory usage, I would still run it within a separate thread, of which would still be susceptible to thread pool starvation.

I am not exactly a software design expert on this topic, so if you still think that it doesn't make sense, let's discuss this further. There's very little resources on this topic, especially in the context of Python and I am always looking to broaden my understanding.

It is correct, however this documentation is on the Hasher trait implementation which is a special Rust construct to build a hash pieces by pieces. The mention of arbitrary stream of bytes is actually a copy paste from the official documentation of that trait: https://doc.rust-lang.org/std/hash/trait.Hasher.html

You also used the same description here. Also, do you have plans to implement larger-than-memory hashing? After confirming that gxhash-py is stable, I am thinking of adding two more variants. A streaming variant for larger-than-memory inputs and a parallel variant that'll use rayon to hash chunks in parallel. Downside to this, is that the parallel and streaming variants will not emit the same hash as the other variants, given the same input. Not sure if you think this would be poor DX.

If you like this second option please let me know. That would imply moving transferring this repo to a "gxhash" organization and name it gxhash-rs, transfer gxhash-csharp and then finally gxhash-py.

I don't mind this, but personally, I think there's some beauty and credibility to have a monolith of a library with all the bindings in one place, similar to polars. It's a shame that you already made gxhash-csharp since that was what I was going to propose next using csbindgen which would generate a Rust binding for .NET and Unity without runtime marshalling. I am impartial to how you want to do this as long as it is clearly within the gxhash ecosystem, so either in a organisation or within this repository.

ogxd commented 1 week ago

The reason for exposing an async version is because I'd imagine gxhash being used a lot in an asynchronous context. More specifically, in web servers. Running a synchronous function, regardless of whether it has dropped the GIL, within an async route handler will block the event loop from handling concurrent requests. If it takes 0.5s for a request to hash a file, that's 0.5s that the server will not be able to respond to any other requests. I have heard other developers mention that even 100 microseconds of blocking is unacceptable. From how I see it, there's no difference between how I would handle a synchronous IO-bound task and a CPU-bound task. As long as neither saturates CPU or memory usage, I would still run it within a separate thread, of which would still be susceptible to thread pool starvation.

I am not exactly a software design expert on this topic, so if you still think that it doesn't make sense, let's discuss this further. There's very little resources on this topic, especially in the context of Python and I am always looking to broaden my understanding.

Let's take your example: We have a backend service handling a request, and at some point, and for some reason while processing that request it must hash some huge data held in memory (let's assume it's not on the disk for now). This operation is purely CPU-bound.
Now imagine these 4 options:

The hash is computed synchronously. This means it will effectively block the event loop thread for that duration. We agree it's not ideal.
The hash is computed asynchronously via async / await. In python, by default, this means the hash computation coroutine will be queued and scheduled at some point by the event loop, but still on the event loop thread. So it brings no improvement.
The hash is computed asynchronously via async / await, but this time you use a ThreadPoolExecutor to delegate this work to a pool of threads (called workers in this jargon). Here you're no longer blocking the event loop! However you might have another issue: your operation will block a worker dedicated to the processing of async / await operations for the same duration. This is called thread pool starvation (not to be confused with thread starvation!).
The hash is computed asynchronously, but not using async / await. Instead, the work is simply delegated to a background thread dedicated to processing such heavy operations. This is the way to go.

As a staff backend engineer I do have some experience on this kind of topic, but don't just take my words for it, there are some resources online on this subject. Here are some:

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/debug-threadpool-starvation#resolving-threadpool-starvation

To eliminate ThreadPool starvation, ThreadPool threads need to remain unblocked so that they're available to handle incoming work items.

https://stackoverflow.com/a/71738644/2664927

Running such operations inside an async def endpoint will block the event loop; and hence, any further client requests will get blocked until the blocking operation is completed.

Now if you read from the disk, or compute a hash block by block, then it's another story. I'd likely use async / await to read from the disk or wait for receiving request bytes, and thus hash on-the-go. In that situation I'd say every chunk hash operation isn't going to be blocking as much as our first example, and thus it might be acceptable to do it synchronously.

You also used the same description here

Oh indeed! We must change that 👍

Also, do you have plans to implement larger-than-memory hashing?

In rust, that's one of the purposes of Hasher trait. You build a Hasher, you hash pieces by pieces, and then you "finalize" your hash. I'm not sure how easily this structure could be mapped to some python class however using this bindgen method.

winstxnhdw commented 1 week ago

The hash is computed asynchronously via async / await, but this time you use a ThreadPoolExecutor to delegate this work to a pool of threads (called workers in this jargon). Here you're no longer blocking the event loop! However you might have another issue: your operation will block a worker dedicated to the processing of async / await operations for the same duration. This is called thread pool starvation (not to be confused with thread starvation!).

Thank you for taking the time to write something up. I am still confused whether only CPU-bound functions can cause thread pool starvation? In Python, not all IO-bound functions have an async variant. For example, open is used to load a file from disk synchronously.

with open("model.bin", "rb") as f:
    file_bytes = f.read()

Performing such an operation would block the event loop and one way to avoid blocking the event loop would be to run open in another thread. Would this not suffer from thread pool starvation as well? If this was the only way to avoid blocking the event loop, would you recommend pushing this to a task queue?

In any case, it would be best to remove the async variants because it requires a blocking and really expensive copy step to convert &[u8] to Vec<u8>.

Also, polars has a CPU-bound API that they made async similarly to how I did it: https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.collect_async.html https://github.com/pola-rs/polars/blob/54218e7e35e3defd4b0801e820c56eea6b91e525/crates/polars-python/src/lazyframe/general.rs#L595

Oh indeed! We must change that 👍

I will wait for you to change the description and then align my docstrings with yours. Also, I am wondering if you plan to push this line to main?

In rust, that's one of the purposes of Hasher trait. You build a Hasher, you hash pieces by pieces, and then you "finalize" your hash. I'm not sure how easily this structure could be mapped to some python class however using this bindgen method.

I think Python Generators would work well here.

mqudsi commented 1 week ago

Now if you read from the disk, or compute a hash block by block, then it's another story. I'd likely use async / await to read from the disk or wait for receiving request bytes, and thus hash on-the-go. In that situation I'd say every chunk hash operation isn't going to be blocking as much as our first example, and thus it might be acceptable to do it synchronously.

I would humbly disagree, or at least say that it would only be acceptable with some caveats; this approach would improve p9x latency by blocking the event loop for a shorter amount of time per call, but the total time the event loop remains "unavailable" to handle events would remain the same (well, actually it would increase because of the function call and ffi overhead combined with the decreased gxhash-side performance of hashing smaller chunks at a time) thereby reducing total throughput.

In my opinion though, this isn't something that the library itself should necessarily manage (though I am not a pythonista and I am talking from the general perspective such as that taken in async rust or async C#, both of which I am considerably more familiar with), only because of the number of factors and permutations that have to be taken into account, all of which the downstream caller integrating gxhash into their project would know more about.

The subtleties here lie greatly in the specific application. If you are hashing content that fits comfortably in memory without causing gc issues or if the final content is going to be stored in memory regardless (i.e. is not being streamed to disk or network) then it might make more sense (given how fast gxhash itself is) to buffer the entirety of the content then make one threadpool-backed call to gxhash that doesn't block the event loop, that runs at max gxhash performance, and minimizes the function call/ffi overhead. But if you are working on streaming content that won't be stored (contiguously!) in memory thereafter, then you're obviously going to have to call gxhash per some smaller-sized slice at a time. In that case, the question of whether this should be done on a threadpool thread or directly from the async routine is a much more complicated question that would involve knowledge of the size of the input (i.e. if you are hashing 16 bytes at a time, the threadpool overhead, context switch, cache misses, etc is going to exceed the benefits of not blocking the event pool) while if you're processing still decently-sized chunks at a time it might make sense to spin up a threadpool task to handle the cpu-intensive work.

But I don't know what the accepted level of "handholding" here is in the python modules community (or even if python devs are interested in optimizations at this level to begin with, though what I wrote above is assuming at least some do).

winstxnhdw commented 1 week ago

But I don't know what the accepted level of "handholding" here is in the python modules community (or even if python devs are interested in optimizations at this level to begin with, though what I wrote above is assuming at least some do).

I always believed that good API design should include sane defaults while also giving the user the option to build their own specialised solution from scratch, and FWIW, I care deeply about such optimisations. I also think providing the user with an async variant is better DX but I can also see that it may mislead users to believe that the async variant is most optimal for their use case.

In that case, the question of whether this should be done on a threadpool thread or directly from the async routine is a much more complicated question that would involve knowledge of the size of the input (i.e. if you are hashing 16 bytes at a time, the threadpool overhead, context switch, cache misses, etc is going to exceed the benefits of not blocking the event pool)

I am not sure what could be much worse than blocking the event loop. When the event loop is blocked, no one can load your web page, your Prometheus service will no longer be able to scrape for metrics, and your monitoring service sends an alert to everyone that your service is down. Of course the other option is for users to push it to a task queue, but on a sufficiently large cluster with a known amount of users, is there really a need to force the user to pay the development complexity and overhead cost of using a task queue?

In that situation I'd say every chunk hash operation isn't going to be blocking as much as our first example, and thus it might be acceptable to do it synchronously.

Also, in Python, all synchronous functions are blocking. This means that hashing every chunk synchronously would still block the event loop until all chunks have been completely hashed. Of course, this is not the case if you are streaming the bytes in from a web/socket.

mqudsi commented 1 week ago

I am not sure what could be much worse than blocking the event loop.

The implication is that you are still blocking the event loop while it does the tasks I mentioned. e.g. one heavily optimized ffi call to gxhash to hash a 4-byte input might actually block the event loop less than the bookkeeping that same event loop thread has to do (synchronously) to send and retrieve the result to an off-thread worker for such a low-latency operation. And if you expand the scope past microbenchmarks, you need to take into account whether or not the "other thread" will cohabit a core that's already servicing another event loop for your application, the trashing of multiple threads' L1 cache lines, etc.

But again, this is only generally speaking for typical async event loops, without possessing arcane knowledge of the minutia of Python scheduler overhead, cpu core selection, etc as that is quite outside my wheelhouse.

winstxnhdw commented 1 week ago

In the context of Python, once you cross the FFI boundary and drop the GIL, everything from then on is effectively non-blocking. Unless you mean the hashing operation saturates the CPU.