uber / h3-py

Python bindings for H3, a hierarchical hexagonal geospatial indexing system
https://uber.github.io/h3-py
Apache License 2.0
834 stars 132 forks source link

What is the status / plan for vectorized APIs? #341

Open wingkitlee0 opened 10 months ago

wingkitlee0 commented 10 months ago

Currently, what is the recommend way to do this (with this package)? For example, I have two numpy arrays of lat/lon and want to convert them into an array of h3 hex

wingkitlee0 commented 8 months ago

For reference: https://github.com/uber/h3-py/compare/master...wingkitlee0:h3-py:basic-vectorized

ajfriend commented 7 months ago

We tried previously having an "unstable" API to experiment with things like vector functions: https://github.com/uber/h3-py/pull/147

I'd be interested in bringing that back in the 4.0 branch, probably still as an unstable/experimental module to indicate the API may change in the future as we figure out what we want it to look like.

wingkitlee0 commented 7 months ago

@ajfriend Thanks for the info.

wingkitlee0 commented 6 months ago

@ajfriend I took a second look on this issue...

A few questions:

  1. It seems that h3-py was designed to be "self-contained" without numpy. But I was thinking to have a set of APIs that accept numpy arrays. This may require numpy in cython which becomes a dependency (not 100% if we need numpy in cython). I am curious about this restriction.
  2. There are a few backends: basic_str, basic_int, etc that shares the same API. Only a subset of them will be relevant for a "vectorized" version. I am thinking to put those vectorized functions in an experimental module (e.g., api.experimental.vectorized) which does not use the symlink-ed __init__.py. Does it make sense?
  3. Do you have a sense of how much faster a vectorized version would be? For now, if I move the for-loop from python to C I get around 30% faster in integer representation.

Thanks

ajfriend commented 6 months ago
  1. It seems that h3-py was designed to be "self-contained" without numpy. But I was thinking to have a set of APIs that accept numpy arrays. This may require numpy in cython which becomes a dependency (not 100% if we need numpy in cython). I am curious about this restriction.

I haven't been deep in the Cython code in a while, but if my memory serves, numpy shouldn't be a hard dependency because it is compatible with the https://docs.python.org/dev/library/stdtypes.html#memoryview interface, which is how we interface with Cython. See some of the discussion around memoryview objects in https://uber.github.io/h3-py/api_comparison.html . To that point, h3.api.memview_int is actually the base API that the other three are built on top of.

A memoryview object is easily converted into a numpy array (and can be done without copying the data; see the docs link above). Any function expecting a memoryview object can accept a numpy array without having to "know about" (or import) numpy arrays.

Because numpy is compatible with the memoryview interface, numpy isn't a dependency in terms of the library being able to accept numpy arrays as inputs. The only reason we include numpy as an optional dependency is that you need it if you want to return numpy arrays as outputs (without expecting the user to convert a memoryview).

(I'll try to get to your other points soon.)

ajfriend commented 6 months ago
  • There are a few backends: basic_str, basic_int, etc that shares the same API. Only a subset of them will be relevant for a "vectorized" version. I am thinking to put those vectorized functions in an experimental module (e.g., api.experimental.vectorized) which does not use the symlink-ed __init__.py. Does it make sense?

That makes sense. Although, recently, I've been thinking there might be another potential way to organize this rather than putting the code in api.experimental.vectorized. h3-py exposes its Cython interface (although I'm not sure we officially support it yet) and other Cython packages should be able to simply add it as a Cython dependency. We should probably come up with a short example repo showing people how to do this (#373).

This would let others write fast Cython code involving the h3-py C and Cython functions, without having to have those live inside the main repo. This vectorized API could be one such example. We don't necessarily have to do it this way, but I wanted to raise the idea and see what other folks thought.

ajfriend commented 6 months ago
  1. Do you have a sense of how much faster a vectorized version would be? For now, if I move the for-loop from python to C I get around 30% faster in integer representation.

I don't off the top of my head. I think it would depend on the application. That being said, I think developing some benchmarks would be helpful. You might be able to test things fairly quickly with something like https://github.com/uber/h3-py/blob/master/tests/test_cython/cython_example.pyx

wingkitlee0 commented 6 months ago

another potential way to organize this other Cython packages should be able to simply add it as a Cython dependency.

That's an interesting idea. I think exposing cython functions is good.

However, I think Cython is hard for most python users. Most people probably prefer pip install h3 and get all the good stuff. In other word, I would like to see more people to use H3 for their geospatial projects and it would be easier if we have only one h3 python library. Maybe a limited subset of vectorized functions is good enough to start. More advanced users can build their library with the Cython functions..

Should we continue the discussion back in Slack (for more visibility?)

BTW, thanks for answering my questions. After reading a little bit more memoryview, I agree that we do not need numpy in Cython for this.

ajfriend commented 6 months ago

For what I'm thinking, the average user would not need to know anything about Cython. If we produced a separate package (called whatever, maybe h3vectorized for example), users would be able to just pip install that. But development would be much easier in the h3vectorized repo, since all the hard CMake and scikit-build stuff is handled separately by the h3-py repo (in the same way that a package that cimports numpy doesn't need to build all of numpy from scratch each time).

Since we're not sure exactly what the vectorized API looks like just yet, I think that could provide a lot of benefits: You're less strongly coupled to the h3-py repo and have more freedom to experiment with API design. In a lot of ways, that's a more natural way to explore an API than having an experimental submodule. You would also not be subject to the fairly rigid versioning requirements we have for h3 bindings to match the core library version in major and minor semver numbers.

If we later finalize an API, we could migrate it into h3-py directly or keep it as a separate package.

Again, I'm not saying this is the only way to do it, but I do think it is an attractive option. But, all in all, once we get that first example package working, I'd expect this to be the much easier path.

ajfriend commented 6 months ago

Also, feel free to flag the discussion in Slack, but I'd like to keep the main discussion on Github since Slack history gets lost relatively quickly.

ajfriend commented 6 months ago

I threw together an example that still needs some work, but this would be the idea: https://github.com/ajfriend/h3_example_package

wingkitlee0 commented 6 months ago

Thanks for the clarifications! I think it makes all sense now. I didn't realize h3-py is more than a Cython wrapper but also build the H3 C library.

migrate it into h3-py directly or keep it as a separate package.

sounds good

Thanks also for the example.. I will try to build something on it..