[BUG] `GeoColumnAccessors` should respect slicing

rapidsai / cuspatial

CUDA-accelerated GIS and spatiotemporal algorithms

https://docs.rapids.ai/api/cuspatial/stable/

Apache License 2.0

624 stars 154 forks source link

[BUG] `GeoColumnAccessors` should respect slicing #771

Closed thomcom closed 2 years ago

thomcom commented 2 years ago

I think I solved this, PR coming.

Describe the bug The most visible interface for a GeoSeries respects slicing:

geoseries = cuspatial.from_geopandas(geopandas.read_file(geopandas.get_path("naturalearth_lowres")))
shorter = geoseries[0:3]
print(shorter)
0    MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1    POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2    POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
Name: geometry, dtype: geometry

However, the underlying GeoColumnAccessor that allows us to access the GeoArrow buffers, does not:

shorter.polygons.geometry_offset
0        0
1        3
2        4
3        5
4       35
      ... 
173    284
174    285
175    286
176    287
177    288
Length: 178, dtype: int32

Expected behavior Users should not have to slice their GeoSeries and also slice the underlying buffers. When a sliced GeoSeries has GeoColumnAccessors called, they should return only the coordinates that are part of the GeoSeries.

thomcom commented 2 years ago

I'm puzzling over a substantial problem and am going to start the conversation about it. Most new algorithms depend on the GeoArrow format, that is that the offsets buffer for a range of $n$ features is always $n+1$ offsets. The offsets include the first and "last" position of every feature, including the final one.

The arrow DenseUnion doesn't seem to respect this - that is, the offsets buffer of the DenseUnion is not length $n+1$.

This is important because of slicing. So far we've always tested our algorithms using the "complete" data source, so that all of the points returned by the various buffer calls always respect all the points in the original buffer.

point_in_polygon revealed a flaw in this approach because of the 31 polygon limit - given an original data source like the naturalearth_lowres database, one needs to slice only 31 polygons from this source. Therefore a sliced dataframe should only contain the points that were sliced, so that subsequent accesses to its buffers remain accurate.

I tried to solve this issue by modifying the offset buffer accessors to use the offsets buffer of the union before they return their results. This works well, with one major problem - if I slice two elements out of an offsets buffer, I only get two elements in the offsets buffer, not the $n+1$ tail.

I'd like to keep our design where the underlying points array is not modified, both for memory usage reasons and also to avoid having to sub-slice all the points in order to make a copy. How can we change an offsets buffer slice operation to return element $n+1$ at the tail?

thomcom commented 2 years ago

It seems like I might have to slice the geometry data, and not just the union_offsets and input_types, so that each GeoSeries fully represents itself.

I was able to implement the above discussion easily, where a sliced dataframe only returns the offsets that were sliced, also, but this immediately broke all of the point_linestring_distance tests since they expect 1 extra value in all of the offsets lists.

If I could just do a slice + 1, that'd solve the issue, but I'm not sure how. Slicing can take many forms, particularly list or slice, that are not trivial (or possible?) to identify what $n+1$ is. Maybe I'm just overlooking something simple.

This is a simple example of a failing test that demonstrates what I'm talking about

def test_point_geometry_offset():
    gs = gpd.GeoSeries(
        [
            MultiPoint([Point(0, 1), Point(1, 0)]),
            MultiPoint([Point(0, 2), Point(2, 0)]),
            MultiPoint([Point(0, 3), Point(3, 0)]),
            MultiPoint([Point(0, 4), Point(4, 0)]),
        ]
    )
    cugs = cuspatial.from_geopandas(gs)
    sliced = cugs[0:2]
    assert len(sliced.multipoints.geometry_offset) == 3

The dataframe has been sliced from four multipoints down to two, but sliced.multipoints.geometry_offset is not length 3, it still has all of the offsets in it. The solution is to slice the geometry offsets, but I don't have a trivial way of slicing to n+1. If I slice the geometry offsets, all the algorithms that expect n+1 break.

thomcom commented 2 years ago

I think I figured it out.

thomcom commented 2 years ago

There are potentially a lot of tests for this, I'm working on that now.