Open wingkitlee0 opened 10 months ago
We tried previously having an "unstable" API to experiment with things like vector functions: https://github.com/uber/h3-py/pull/147
I'd be interested in bringing that back in the 4.0 branch, probably still as an unstable/experimental module to indicate the API may change in the future as we figure out what we want it to look like.
@ajfriend Thanks for the info.
@ajfriend I took a second look on this issue...
A few questions:
h3-py
was designed to be "self-contained" without numpy
. But I was thinking to have a set of APIs that accept numpy arrays. This may require numpy in cython which becomes a dependency (not 100% if we need numpy in cython). I am curious about this restriction.basic_str
, basic_int
, etc that shares the same API. Only a subset of them will be relevant for a "vectorized" version. I am thinking to put those vectorized functions in an experimental module (e.g., api.experimental.vectorized
) which does not use the symlink-ed __init__.py
. Does it make sense?Thanks
- It seems that
h3-py
was designed to be "self-contained" withoutnumpy
. But I was thinking to have a set of APIs that accept numpy arrays. This may require numpy in cython which becomes a dependency (not 100% if we need numpy in cython). I am curious about this restriction.
I haven't been deep in the Cython code in a while, but if my memory serves, numpy
shouldn't be a hard dependency because it is compatible with the https://docs.python.org/dev/library/stdtypes.html#memoryview interface, which is how we interface with Cython. See some of the discussion around memoryview
objects in https://uber.github.io/h3-py/api_comparison.html . To that point, h3.api.memview_int
is actually the base API that the other three are built on top of.
A memoryview
object is easily converted into a numpy
array (and can be done without copying the data; see the docs link above). Any function expecting a memoryview
object can accept a numpy
array without having to "know about" (or import) numpy
arrays.
Because numpy
is compatible with the memoryview
interface, numpy
isn't a dependency in terms of the library being able to accept numpy
arrays as inputs. The only reason we include numpy
as an optional dependency is that you need it if you want to return numpy
arrays as outputs (without expecting the user to convert a memoryview
).
(I'll try to get to your other points soon.)
- There are a few backends:
basic_str
,basic_int
, etc that shares the same API. Only a subset of them will be relevant for a "vectorized" version. I am thinking to put those vectorized functions in an experimental module (e.g.,api.experimental.vectorized
) which does not use the symlink-ed__init__.py
. Does it make sense?
That makes sense. Although, recently, I've been thinking there might be another potential way to organize this rather than putting the code in api.experimental.vectorized
. h3-py
exposes its Cython interface (although I'm not sure we officially support it yet) and other Cython packages should be able to simply add it as a Cython dependency. We should probably come up with a short example repo showing people how to do this (#373).
This would let others write fast Cython code involving the h3-py
C and Cython functions, without having to have those live inside the main repo. This vectorized API could be one such example. We don't necessarily have to do it this way, but I wanted to raise the idea and see what other folks thought.
- Do you have a sense of how much faster a vectorized version would be? For now, if I move the for-loop from python to C I get around 30% faster in integer representation.
I don't off the top of my head. I think it would depend on the application. That being said, I think developing some benchmarks would be helpful. You might be able to test things fairly quickly with something like https://github.com/uber/h3-py/blob/master/tests/test_cython/cython_example.pyx
another potential way to organize this other Cython packages should be able to simply add it as a Cython dependency.
That's an interesting idea. I think exposing cython functions is good.
However, I think Cython is hard for most python users. Most people probably prefer pip install h3
and get all the good stuff. In other word, I would like to see more people to use H3 for their geospatial projects and it would be easier if we have only one h3 python library. Maybe a limited subset of vectorized functions is good enough to start. More advanced users can build their library with the Cython functions..
Should we continue the discussion back in Slack (for more visibility?)
BTW, thanks for answering my questions. After reading a little bit more memoryview
, I agree that we do not need numpy in Cython for this.
For what I'm thinking, the average user would not need to know anything about Cython. If we produced a separate package (called whatever, maybe h3vectorized
for example), users would be able to just pip install
that. But development would be much easier in the h3vectorized
repo, since all the hard CMake
and scikit-build
stuff is handled separately by the h3-py
repo (in the same way that a package that cimport
s numpy
doesn't need to build all of numpy
from scratch each time).
Since we're not sure exactly what the vectorized API looks like just yet, I think that could provide a lot of benefits: You're less strongly coupled to the h3-py
repo and have more freedom to experiment with API design. In a lot of ways, that's a more natural way to explore an API than having an experimental
submodule. You would also not be subject to the fairly rigid versioning requirements we have for h3
bindings to match the core library version in major and minor semver
numbers.
If we later finalize an API, we could migrate it into h3-py
directly or keep it as a separate package.
Again, I'm not saying this is the only way to do it, but I do think it is an attractive option. But, all in all, once we get that first example package working, I'd expect this to be the much easier path.
Also, feel free to flag the discussion in Slack, but I'd like to keep the main discussion on Github since Slack history gets lost relatively quickly.
I threw together an example that still needs some work, but this would be the idea: https://github.com/ajfriend/h3_example_package
Thanks for the clarifications! I think it makes all sense now. I didn't realize h3-py
is more than a Cython wrapper but also build the H3 C library.
migrate it into h3-py directly or keep it as a separate package.
sounds good
Thanks also for the example.. I will try to build something on it..
Currently, what is the recommend way to do this (with this package)? For example, I have two numpy arrays of lat/lon and want to convert them into an array of h3 hex