holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
308 stars 24 forks source link

Allow using cx indexer without spatial index #54

Closed philippjfr closed 3 years ago

philippjfr commented 3 years ago

The problem with using cx on a DaskGeoDataFrame was that it would use cx on each partition which in turn would spatially index that partition because it was using the sindex attribute. It seems as though these spatial indexing wouldn't persist either so every time you did some indexing it would create a completely new sindex slowing down the indexing massively. I now only use the sindex if it already exists. I would like there to be a more explicit API on GeometryArrays, GeoSeries, GeoDataFrames, DaskGeoSeries and DaskGeoDataFrame that says "create an spatial index" rather than using the implicit behavior of creating a spatial index when sindex is accessed.

jlstevens commented 3 years ago

I would like there to be a more explicit API on GeometryArrays, GeoSeries, GeoDataFrames, DaskGeoSeries and DaskGeoDataFrame that says "create an spatial index" rather than using the implicit behavior of creating a spatial index when sindex is accessed.

I agree with this sentiment.

philippjfr commented 3 years ago

So I'm guessing that the fact that the implicitly created _sindex doesn't persist on the partitions is expected behavior for dask. I'm guessing dask makes (shallow) copies of the partition dataframes internally precisely to avoid you from performing any inplace modifications of the internal state.

philippjfr commented 3 years ago

Added explicit build_sindex methods to all relevant types.

jonmmease commented 3 years ago

Taking a look... I'm pretty sure that the partition-level spatial indexes used to persist, but it also makes sense that this isn't something Dask would want a library to rely on :slightly_smiling_face:

jonmmease commented 3 years ago

These look like reasonable updates to me. And overall, it might just be easier to remove all implicit spatial index initialization, and just have methods that need one raise an error if it hasn't been initialized yet.

Added explicit build_sindex methods to all relevant types.

Did you push these changes? I didn't see them in the diff just now.

For DaskGeoDataFrame, I think we'd also want the option to initialize the partition-level spatial indexex with an option to the read_parquet_dask method.

philippjfr commented 3 years ago

Sorry pushed now.

jonmmease commented 3 years ago

The build_sindex methods look good. Have you been able to confirm that spatial index in the resulting Dask data structures persists? It's not obvious to me why the sindex would persist in the return value, but not the input value since these both reference the same partition-level GeoDataFrames.

philippjfr commented 3 years ago

The build_sindex methods look good. Have you been able to confirm that spatial index in the resulting Dask data structures persists?

I have yes. It's also very unclear to me but here's the test cases I'm using (tested before and after this PR respectively):

df.cx[-19986136.0: -18986136.0, 445.5396728515625:10000].compute()

p = df.cx_partitions[-19986136.0:-18986136.0, 445.5396728515625:10000]

def report_sindex(df):
    print(df.geometry.array._sindex)
    return df

p.map_partitions(report_sindex, meta=p._meta).persist()
None
df = df.build_sindex().persist()

def report_sindex(df):
    print(df.geometry.array._sindex)
    return df

df.map_partitions(report_sindex, meta=df._meta).persist()
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12ca72f40>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d154a90>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d29d1c0>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d29d760>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x130bdd3a0>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d35d880>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d1f1070>
jonmmease commented 3 years ago

Ok, looks good!

philippjfr commented 3 years ago

Any opinion whether I should release this as spatialpandas 0.3.7 or 0.4.0?

jonmmease commented 3 years ago

Any opinion whether I should release this as spatialpandas 0.3.7 or 0.4.0?

Not a strong one. I'd lean toward 0.4.0 since it fixes a bug by adding to the API.