uber / h3-java

Java bindings for H3, a hierarchical hexagonal geospatial indexing system
https://uber.github.io/h3/
Apache License 2.0
268 stars 53 forks source link

JVM crash on large polyfills #21

Open willcohen opened 6 years ago

willcohen commented 6 years ago

When the following is added to TestH3Core.java, the jvm crashes out on h3Api.polyfill before completing the tests. The polyfill works at resolution 9.


    @Test
    public void testPolyfillLarge() {
        List<Long> hexagons = h3.polyfill(
                ImmutableList.<GeoCoord>of(
                        new GeoCoord(41.5, -70.4),
                        new GeoCoord(41.5, -69.8),
                        new GeoCoord(42.1, -69.8),
                        new GeoCoord(42.1, -70.4)
                ), null,
                10
        );

        assertTrue(hexagons.size() > 1000);
    }

Is this an issue with the java wrapper, or is this impossible to do with the core library?

I don't know C but adding a similar test to h3 naively copying the existing styles to testPolyfill.c at level 11 still seems to work. If I increase the resolution up past 11 it eventually segfaults there too.

GeoCoord capeCodVerts[] = {
        {0.7243116395776468, -1.2287117934040082},  {0.7243116395776468, -1.218239817892042},
        {0.7347836150896128, -1.218239817892042}, {0.7347836150896128, -1.2287117934040082}};
Geofence capeCodGeofence;
GeoPolygon capeCodGeoPolygon;

capeCodGeofence.numVerts = 4;
capeCodGeofence.verts = capeCodVerts;
capeCodGeoPolygon.geofence = capeCodGeofence;
capeCodGeoPolygon.numHoles = 0;

TEST(polyfillLarge) {
    int numHexagons = H3_EXPORT(maxPolyfillSize)(&capeCodGeoPolygon, 12);
    H3Index* hexagons = calloc(numHexagons, sizeof(H3Index));

    H3_EXPORT(polyfill)(&capeCodGeoPolygon, 11, hexagons);
    int actualNumHexagons = 0;
    for (int i = 0; i < numHexagons; i++) {
        if (hexagons[i] != 0) {
            actualNumHexagons++;
        }
    }

    t_assert(actualNumHexagons > 10000, "got lots of hexagons");
    free(hexagons);
}
dfellis commented 6 years ago

So part of this is simply the relative memory consumption of bare C versus Java with its object wrapping (the difference between the C and Java tests on how fine of a resolution you're getting to).

Part of it is the inefficient (but guaranteed correct) way polyfill operates (it determines the minimum circle necessary to surround the bounding box of your polygon, then determines the "k-ring" sized hexagon-like fill necessary to surround that, and then perform point-in-polygon searches across that set to determine the actual set of hexagons contained within the polygon).

And then just the practical consideration that there are over 33 billion hexagons on the planet at resolution 10 and the absolutely most-compact representation (all at the same hexagon resolution) would require over 250GB of RAM to contain. (The highest resolution, 15, has over 569 trillion hexagons on the planet and would require a little over 4 petabytes of RAM to store.)

It would be possible to represent this with far fewer hexagons if the output set was mixed-resolution (a compacted set), but I'm a bit hesitant to create a polyfill that also inlines compact because of how easy it would be to create uncompact-able sets as described above. It also seemed unlikely to us that anyone working with polygons at that size would need meter-level accuracy?

willcohen commented 6 years ago

It's definitely true that meter-accuracy of that particular shape is unnecessary. I was running polyfill at meter-level accuracy on the census bureau's census blocks for Massachusetts to try to replace the geometries with hexagons -- resolution 10 followed by a compact seemed reasonable to make sure that a good representation of urban center blocks was being made. Unfortunately, it turns out that the census bureau also has the edge of all of Outer Cape Cod represented as a single block, hence the crash. I suppose it's particularly extreme for the circle/k-ring approximation since it's a long thin strip.

Does it seem right that a workaround (without asking h3 to have to worry about returning compacted results and open that can of worms) is for me to always first run a guesstimate on the polygon using the same process found in maxPolyfillSize internally, and if that's too large, subdivide the polygon till the guesstimate is acceptable again, polyfill and compact each of those, and potentially uncompact that set if a duplication of maxUncompactSize's estimate seems reasonable?

dfellis commented 6 years ago

@willcohen that would definitely work around this issue for the time being, but the current polyfill is pretty bad in both space and time usage (it was just the simplest to implement correctly across icosahedron boundaries).

This pull request is actually a stepping stone to a possible solution. If we can return to a simple 2D representation of the hexagons, then we can first trace "lines" of hexagons around the bounding box and then simply test those hexagons and all of the IJ coordinates within the defined space to determine which belong in the polyfill. (That implementation does not support crossing icosahedron edges, however, so it's only part of the solution.)

The big advantage with that is two fold:

  1. maxPolyfillSize would always have a smaller volume than before (not needing to define a hexagon that contains a circle that contains the bounding box) and would have a greater impact the less square the bounding box is.
  2. An exactPolyfillSize function could be made for edge cases like the Cape Code geofence you've described. It would be slower (literally doing the entire polyfill operation described above, except incrementing a counter on matches instead of saving the H3Index in an array), but then would let you know exactly how many hexagons there are and need to allocate memory for, but would effectively double the compute time.

The disadvantage is the icosahedron constraint. There was talk of an automatic "unfolding" of icosahedrons and coordinate rotation to handle crossing a single edge, but we couldn't figure out a way to handle multiple icosahedrons simultaneously without weird glitches. Another solution we proposed internally was to just intersect the polygon per icosahedron and perform the point-in-poly on each generated polygon and then merge the results, but we'll have to implement polygon intersections, precise icosahedron polygon generation (maybe we already have that?), and the 2D grid (@isaacbrodsky has already done most of that).

Unfortunately I don't work at Uber anymore, so I only get to spend my free time on this. That's a rough outline of one discussed approach to improving polyfill if someone wants to tackle it (or maybe even provide an even better algorithm? ;) ).

willcohen commented 6 years ago

Thanks, @dfellis. I'll probably need to follow along for a while with the discussion before jumping in. Separately, thanks to everyone for your work on the library. We're finding it quite useful!

isaacbrodsky commented 6 years ago

Glad to hear you're finding it useful! I think the approach of splitting up the polygon if too large for a single pass makes sense as a workaround.

I found the C test case was failing on STACK_ARRAY_CALLOC(int, distances, maxIdx); in kRing, as called by polyfill. Changing that to a heap alloc allowed the test case to pass. Unfortunately, as mentioned in #10, polyfill and kRing aren't really able to communicate allocation failures back to Java right now. We'll need to address that in the core library.

dfellis commented 6 years ago

When was that switched from heap to stack and why?

isaachier commented 6 years ago

You can overflow the stack. Didn't realize how large this was.

isaachier commented 6 years ago

@dfellis I prefer stack allocation to heap whenever possible. Yes, some programmers disagree with the practice, but I find stack allocation is potentially faster and much less leak-prone than using malloc and free for local arrays. The important caveat here is that it must not overflow the stack, which happens here.

dfellis commented 6 years ago

@isaachier That's not a good motive to put allocation on the stack for arrays, in my mind. You'll get strange stack overflows at some point where a simple refactor of code causes an extra stack frame to be generated and the failure will be nonsensical to most developers, even if you're playing it safe with the stack allocations, because of how much "smaller" the function-call-usable stack becomes.

I can understand stack allocations when dealing with a fixed number of structs or predictably-bounded arrays, but doing that on a user-defined input size is just asking for trouble. Can we revert the stack allocations in kRing and any other functions that allocate based on the max* function outputs?

isaachier commented 6 years ago

It is a fair point. I will check where the functions are using user-defined sizes and see what I can do.

isaachier commented 6 years ago

See https://github.com/uber/h3/pull/100.