uber / h3-py

Python bindings for H3, a hierarchical hexagonal geospatial indexing system
https://uber.github.io/h3-py
Apache License 2.0
842 stars 132 forks source link

h3.h3_distance() raising SystemError #169

Closed deeplook closed 3 years ago

deeplook commented 4 years ago

When using h3.h3_distance() to obtain the path distance between H3 cells (same resolution) I'm often getting a SystemError like below. Am I expecting the wrong thing or just doing something wrong? Are there any constraints I'm not aware of? The docstring doesn't mention any.

'''
---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-571-d834950e20a7> in <module>
     15 """
     16 import h3
---> 17 d = h3.h3_distance("8353b0fffffffff", "835804fffffffff")

[...]/lib/python3.7/site-packages/h3/api/_api_template.py in h3_distance(h1, h2)
    260         h2 = _in_scalar(h2)
    261 
--> 262         d = _cy.distance(h1, h2)
    263 
    264         return d

SystemError: <built-in function distance> returned NULL without setting an error
'''
import h3
d = h3.h3_distance("8353b0fffffffff", "835804fffffffff")

Both sample codes above are valid according to h3.h3_is_valid(). I have installed h3-py on macOS from conda-forge and h3.versions gives {'c': '3.6.4', 'python': '3.6.4'}.

Same on a Linux box:

Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import h3
   ...: c1, c2 = "8353b0fffffffff", "835804fffffffff"
   ...: assert h3.h3_is_valid(c1)
   ...: assert h3.h3_is_valid(c2)
   ...: h3.h3_distance(c1, c2)
---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-1-25e9eaa0702f> in <module>
      3 assert h3.h3_is_valid(c1)
      4 assert h3.h3_is_valid(c2)
----> 5 h3.h3_distance(c1, c2)

/opt/conda/lib/python3.8/site-packages/h3/api/_api_template.py in h3_distance(h1, h2)
    260         h2 = _in_scalar(h2)
    261 
--> 262         d = _cy.distance(h1, h2)
    263 
    264         return d

SystemError: <built-in function distance> returned NULL without setting an error
isaacbrodsky commented 4 years ago

It looks like these cells are on base cells 41 and 44 (you can find this on https://observablehq.com/@nrabinowitz/h3-index-inspector Base cells are the same as the resolution 0 parent of the cell.), indicating they're relatively far away on the globe. The h3_distance function has a limitation that it can only find grid distances between two cells that are on the same or neighboring base cells, which these are not. At this distance I'd suggest finding the distance in a measure like kilometers, for example by using the Haversine formula.

deeplook commented 4 years ago

Thanks, also for the inspector link! While this surely makes sense it would be nice to raise a more helpful exception and also mention this fact in the docstring of h3_distance().

ajfriend commented 4 years ago

Definitely agreed that we should have a better exception here, and document what's going on. I'll keep this issue open until we do :)

ajfriend commented 4 years ago

...however, as a workaround, how about something like this:

def my_line(a, b):    

    def dist(x):
        return h3.point_dist(h3.h3_to_geo(x), h3.h3_to_geo(b))

    path = [a]
    while a != b:
        a = min(h3.hex_ring(a), key=dist)
        path += [a]

    return path

def my_dist(a, b):
    return len(my_line(a, b)) - 1

I'm fairly certain that this always gives you a valid path between any two valid of the same resolution (even if their base cells are too far apart). However, I'm not sure this gives you the minimal path between the two cells, so the number coming out of my_dist would only be an upper bound on the true distance.

And, honestly, there's a chance that this does give you the right distance in all cases. (Anyone see a counterexample?) And even if it doesn't, I'd guess that it wouldn't be far off.

@nrabinowitz @dfellis @isaacbrodsky Would this strategy work in the C library to extend the functionality of h3Distance to handle farther apart cells? Maybe this method is far slower than the current method? Perhaps we could make safe/slow and unsafe/fast versions of the functions, and use similar fallback logic that we use elsewhere?

dfellis commented 4 years ago

@ajfriend that will definitely work, but it is both very slow and very memory intensive. I think it is O(n^2) in performance and O(n) in memory with large constant factors to boot?

But I do believe that it is guaranteed to provide the minimum distance, as it is literally checking all potential hexes at each distance until it finally finds the right value. If it is used in a background job on a machine with a lot of RAM, it is fine. If you want it anywhere close to a hot loop, it's not a great idea.

I have thought of another workaround for this problem, but I haven't had the time to come close to working on it since I started working on alan-lang. Basically, take a great circle arc from the origin hexagon to the destination hexagon and compute all of the intersection points of that arc with the arcs defining the boundaries of the icosahedron edges. Then calculate the hexagons for each point and then calculate the distances between each piece wise and sum them.

It should also be the minimum distance (we can check with your code above) but should be roughly the same execution time as the existing algorithm.

ajfriend commented 4 years ago

It can be O(1) in memory (you don't need to actually keep the list in memory if all you want is the integer distance). And I think it would be O(n) in compute (where n is the final distance), because it isn't searching all hexagons at each step, it's only searching the neighbors of the current "working" hexagon. But this is assuming the operation of finding a hex's 5 or 6 neighbors is O(1), but I'm not sure how that algorithm is implemented.

Asymptotically, that's about as good as you can do, right? My only concern with the above would be high constant factors, like you said.

ajfriend commented 4 years ago

Oh, and now I'm realizing the code above might be super confusing because it's overloading "distance" to mean both the integer "graph" distance between two cells, and the float great circle distance between two points (in this case, the centroids of the cells). :/

dfellis commented 4 years ago

It can be O(1) in memory (you don't need to actually keep the list in memory if all you want is the integer distance). And I think it would be O(n) in compute (where n is the final distance), because it isn't searching all hexagons at each step, it's only searching the neighbors of the current "working" hexagon. But this is assuming the operation of finding a hex's 5 or 6 neighbors is O(1), but I'm not sure how that algorithm is implemented.

Asymptotically, that's about as good as you can do, right? My only concern with the above would be high constant factors, like you said.

So it's O(n) in memory because each ring has 6n nodes (except ring zero having 1, but that's irrelevant here). That set size is dependent on the distance.

It is O(n^2) in runtime because each set calculation has to build up through the rings to get the resulting set - you can't give it the prior ring as a starting point (right now, anyway.

ajfriend commented 4 years ago

You never build the n-ring, you only ever look at the 1-ring around the current working node, so the memory should still be O(1). That's what the code above is doing.

If a is the current working node, then h3.hex_ring(a) is only ever a set of 5 or 6 cells (k=1 is a default parameter).

And it's O(n) in time, because there are n steps (n different working nodes), where each step updates the current working node to the closest (in terms of great circle distance) neighbor out of the the 1-ring around the current working node (only 6 cells at each step).

I think this is basically what @nrabinowitz did in his pathfinding notebook, except that for this application we don't need the full generality of the A* algorithm (no need for backtracking; we can get away with the greedy heuristic).

dfellis commented 4 years ago

Ahh, sorry, I read your code incorrectly.

You're right on the memory consumption and compute time. Since it isn't successive rings from the origin you may be right that distortion could cause haversine distance to not monotonically increase with grid distance. I don't have a good idea on how to confirm that isn't the case.

nrabinowitz commented 4 years ago

A few comments here:

ajfriend commented 4 years ago
  • You'd want k_ring not hex_ring unless you're sure there are no pentagons involved.

Hmmm. I thought the difference was that kRing was the "filled-in disk" around a cell, and hexRing was the "hollow ring".

Is it the case that hexRing will fail even for a "1-ring" if it hits a pentagon?

@nrabinowitz

nrabinowitz commented 4 years ago

Is it the case that hexRing will fail even for a "1-ring" if it hits a pentagon?

I think so:

 * A nonzero failure code may be returned in some cases, for example,
 * if a pentagon is encountered.

We really ought to offer a ring function without this limitation if possible, e.g. falling back to a slow version like we do with kRing.

ajfriend commented 4 years ago

Actually, scrap all that; I've got a new idea! @nrabinowitz @dfellis

Check out this notebook: https://gist.github.com/ajfriend/d2e6d71499c49e26a627e844dd55b573

If we can find a cell that we know is in at least one shortest path between the two cells, then we can bisect the problem down until we get subproblems that can be solved by the base distance function.

I think that the method I have there of taking the midpoint for the cells works in all cases, but we'd have to verify.

The one case it currently does not work on is if the two cell centroids are exactly (or numerically) diametrically opposite each other on the sphere.

And this would be easy (maybe even easier) to implement in C, and thus have an extended version of the distance algorithm. And the same approach would work for h3_line.

nrabinowitz commented 4 years ago

Ooh, that's cool. One issue here is that at present, h3Line is actually a little "bent" across icosahedron edges w/r/t the real great arc - see https://observablehq.com/@nrabinowitz/h3-line-algo-comparison (set res to 4 or so and zoom out):

image

So @dfellis's suggestion, getting cells for the great arc/icosahedron edge intersections, which is what @isaacbrodsky and I had discussed, would likely give us slightly nicer great-arc-conforming lines. I'd worry a little that we'd get kind of jagged lines with the proposed option (though the grid distances would likely be correct).

ajfriend commented 4 years ago

Ah, bummer. Yeah, it's fairly easy to get a counter-example from the type of path you have there. For example,

h1 = '8428309ffffffff'
h2 = '84361c9ffffffff'
c = midcell(h1, h2)

Then h3.h3_distance(h1,h2) == 57, but h3.h3_distance(h1, c) + h3.h3_distance(c, h2) == 58 :(

It isn't off by much tho:

Screen Shot 2020-10-07 at 2 49 55 PM

The red cells are h1 and h2, the yellow cell is c, and the orange cells are all the cells which are on some shortest path between h1 and h2.