Closed thomcom closed 2 years ago
I'm puzzling over a substantial problem and am going to start the conversation about it. Most new algorithms depend on the GeoArrow format, that is that the offsets buffer for a range of $n$ features is always $n+1$ offsets. The offsets include the first and "last" position of every feature, including the final one.
The arrow DenseUnion doesn't seem to respect this - that is, the offsets buffer of the DenseUnion is not length $n+1$.
This is important because of slicing. So far we've always tested our algorithms using the "complete" data source, so that all of the points returned by the various buffer calls always respect all the points in the original buffer.
point_in_polygon revealed a flaw in this approach because of the 31 polygon limit - given an original data source like the naturalearth_lowres database, one needs to slice only 31 polygons from this source. Therefore a sliced dataframe should only contain the points that were sliced, so that subsequent accesses to its buffers remain accurate.
I tried to solve this issue by modifying the offset buffer accessors to use the offsets buffer of the union before they return their results. This works well, with one major problem - if I slice two elements out of an offsets buffer, I only get two elements in the offsets buffer, not the $n+1$ tail.
I'd like to keep our design where the underlying points array is not modified, both for memory usage reasons and also to avoid having to sub-slice all the points in order to make a copy. How can we change an offsets buffer slice operation to return element $n+1$ at the tail?
It seems like I might have to slice the geometry data, and not just the union_offsets and input_types, so that each GeoSeries fully represents itself.
I was able to implement the above discussion easily, where a sliced dataframe only returns the offsets that were sliced, also, but this immediately broke all of the point_linestring_distance tests since they expect 1 extra value in all of the offsets lists.
If I could just do a slice + 1, that'd solve the issue, but I'm not sure how. Slicing can take many forms, particularly list or slice, that are not trivial (or possible?) to identify what $n+1$ is. Maybe I'm just overlooking something simple.
This is a simple example of a failing test that demonstrates what I'm talking about
def test_point_geometry_offset():
gs = gpd.GeoSeries(
[
MultiPoint([Point(0, 1), Point(1, 0)]),
MultiPoint([Point(0, 2), Point(2, 0)]),
MultiPoint([Point(0, 3), Point(3, 0)]),
MultiPoint([Point(0, 4), Point(4, 0)]),
]
)
cugs = cuspatial.from_geopandas(gs)
sliced = cugs[0:2]
assert len(sliced.multipoints.geometry_offset) == 3
The dataframe has been sliced from four multipoints down to two, but sliced.multipoints.geometry_offset is not length 3, it still has all of the offsets in it. The solution is to slice the geometry offsets, but I don't have a trivial way of slicing to n+1. If I slice the geometry offsets, all the algorithms that expect n+1 break.
I think I figured it out.
There are potentially a lot of tests for this, I'm working on that now.
I think I solved this, PR coming.
Describe the bug The most visible interface for a
GeoSeries
respects slicing:However, the underlying
GeoColumnAccessor
that allows us to access theGeoArrow
buffers, does not:Expected behavior Users should not have to slice their
GeoSeries
and also slice the underlying buffers. When a slicedGeoSeries
hasGeoColumnAccessors
called, they should return only the coordinates that are part of theGeoSeries
.