uber / h3-py

Python bindings for H3, a hierarchical hexagonal geospatial indexing system
https://uber.github.io/h3-py
Apache License 2.0
810 stars 130 forks source link

Functions on collections of input #237

Open NickLucche opened 2 years ago

NickLucche commented 2 years ago

Hey, I was wondering whether there's a more efficient way to run functions on a list of inputs (e.g. hex ids), using a simple loop for functions like [h3.h3_to_geo_boundary(h, geo_json=True)) for h in hexes] doesn't seem to be fast enough considering we could run the loop at a lower level (C).

Are there any efforts atm trying to provide an API that supports this..?

dfellis commented 2 years ago

To answer your exact question if there are efforts on that, yes, there is the unstable vectorization API: https://github.com/uber/h3-py/blob/master/src/h3/unstable/vect.py backed by Cython code: https://github.com/uber/h3-py/blob/master/src/h3/_cy/unstable_vect.pyx

It does not vectorize the particular API you called, yet, and the API presented may change based on feedback (such as your own!), but there it is.

If the Cython code is understandable to you I don't think @ajfriend or @kylebarron would mind a PR to add the method(s) you're interested in? I know work on this front slowed down some with the v3-to-v4 migration work in progress.

NickLucche commented 2 years ago

Thanks a lot for your help, the Cython approach looks promising in speeding-up operations on datasets. I'll take a look at the code and see if I can lend a hand on the topic of this issue.

kylebarron commented 2 years ago

The standard way to provide vectorization is to have numpy arrays as input and output. This can be difficult in cases where dimensions are not constant. I mention this because we'd need to decide how to handle functions where the output varies over hexagons and the few pentagons.

Specifically, I'm guessing that h3_to_geo_boundary creates a different number of coordinates depending on whether the input is a pentagon (5 output coords) or a hexagon (6 output coords). In a vectorized API we'd need to explain to the user how to handle a missing last coordinate of a hexagon.

An input array of N h3 cells passed to h3_to_geo_boundary_vect would return an array of shape (N, 2, 6), where the second axis is lon/lat coords and the third axis is the coords within each linearring.

Once https://github.com/uber/h3-py/pull/234 is merged you could even test a loop in your own code, and just link to the installed h3-py package.

dfellis commented 2 years ago

Specifically, I'm guessing that h3_to_geo_boundary creates a different number of coordinates depending on whether the input is a pentagon (5 output coords) or a hexagon (6 output coords). In a vectorized API we'd need to explain to the user how to handle a missing last coordinate of a hexagon.

It's even worse than that. :/ For "Class III" cells (the odd-numbered resolutions) you also get distortion points where hexagon or pentagon intersects the underlying icosahedron, so I think you can have up to 10 points pop out of h3_to_geo_boundary?

kylebarron commented 2 years ago

There are some ways around that, to encode polygons of varying lengths as a single contiguous buffer. One initiative I've been hoping to push forward is the "geoarrow" spec and specifically its WIP arrow-native geometry encoding. With such a solution the Cython loop would iterate over coordinates and add to the Arrow array, then pass a single buffer back to Python.

I'm just putting that as an example, and not necessarily advocating for that in h3 though, as I doubt we'd want such a dependency. But if we have a performant public Cython API, an external package could implement that.

NickLucche commented 2 years ago

I didn't think about the possibility of getting back up to 10 points per polygon, that's kind of a problem to work with such jagged arrays.. do you think it would be possible/make sense to have a simpler version returning a list of polygons/memviews from the cython api? Even tho it should only improve the speed of the for loop over the python one, I believe other functions with a similar signature could benefit from it.

dfellis commented 2 years ago

I didn't think about the possibility of getting back up to 10 points per polygon, that's kind of a problem to work with such jagged arrays.. do you think it would be possible/make sense to have a simpler version returning a list of polygons/memviews from the cython api? Even tho it should only improve the speed of the for loop over the python one, I believe other functions with a similar signature could benefit from it.

Well, you can avoid it by sticking with Class II cells (the even number resolutions) those are all as @kylebarron assumed: 5 or 6 points only.

It would also be possible to extract the "main" hexagon/pentagon points by taking the cell, first getting all unidirectional edges leaving the cell: https://h3geo.org/docs/api/uniedge#geth3unidirectionaledgesfromhexagon and then calling https://h3geo.org/docs/api/uniedge#geth3unidirectionaledgeboundary to get the line definition of that segment and taking just the first point of each one. But that's definitely not gonna be fast.

NickLucche commented 2 years ago

Okay I can see some utility in having a fast function although limited to even resolutions; furthermore, if I understand that correctly, pentagons should only appear at the vertices of the icosahedron, which are placed in generally "not-so interesting areas" and are always the same amount (I think).

I believe I can work with these assumptions for my use-case, even if I were to simply check the bounding box of an area against the icosahedron vertices beforehand to spot pentagons, thanks a lot!

ajfriend commented 2 years ago

@NickLucche, now that we've landed #234, you should be able to cimport the C and Cython functions to write fast Cython code however you'd like.

However, I don't think we'll officially release these changes in a 3x version of the library, since we're currently working on v4 and we need to clean up the Cython interface. But in the meantime, you can still play around with the changes by installing the unreleased code directly from the GitHub repo with something like:

pip install git+https://github.com/uber/h3-py

Here's a quick example notebook: https://gist.github.com/ajfriend/167afffde74bc08b0e809e10b92ecde3

NickLucche commented 2 years ago

Thanks for the heads-up, I'll be sure to try that out!

NickLucche commented 2 years ago

Following up on that, I've tested something very simple like this:

from cython cimport boundscheck, wraparound
from h3._cy.h3lib cimport H3int
import h3._cy as h3cy

@boundscheck(False)
@wraparound(False)
cpdef void h3_to_geo_boundary_vect(
    H3int[:] hexes,
    double[:, :, :] out
):

    cdef Py_ssize_t i, j

    for i in range(len(hexes)):
        pol = h3cy.cell_boundary(hexes[i], True)
        for j in range(7):
            out[i, j, 0] = pol[j][0]
            out[i, j, 1] = pol[j][1] 

# test code, hexes is a list of h3 ids covering the whole area of italy  
it = [h3cy.hex2int(h) for h in hexes]
out = np.zeros((len(hexes), 7, 2), dtype=np.float64)
h3_to_geo_boundary_vect(h3cy.from_iter(it), out)

and I've only observed very mild speedups when applied to hexes in the order of ~3e5 up to ~3e6, compared to the original [h3.h3_to_geo_boundary(h, geo_json=True) for h in hexes]. Here (np.zeros((len(hexes), 7, 2))) I've wrongly assumed that cell_boundary only returns hexagons, but as you've rightfully argued this is not always true (I really just wanted to test out a lower-bound in terms of speed).

I guess that unless I've made some very silly mistake in my sample code, for my use case one could make a more efficient version of cell_boundary always working in geo_json=True mode, writing results to a pre-allocated area of memory instead of returning a tuple to be later copied over. Still, I don't expect to be able to go much faster that this, due to the time it takes to simply execute the cell_boundary algorithm.

kylebarron commented 2 years ago

Some notes:

Given the above, if you used h3lib.h3ToGeoBoundary(h, &gb) directly, and copied that output into a numpy array, you could probably get a significant speedup.