holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
308 stars 24 forks source link

DaskDataFrame to DaskGeoDataFrame #65

Closed sigint8 closed 3 years ago

sigint8 commented 3 years ago

Is your feature request related to a problem? Please describe.

Cross posting from discourse. Not sure if spatialpandas members watch that https://discourse.holoviz.org/t/spatialpandas-daskdataframe-to-daskgeodataframe/706

In spatialpandas, dd.from_pandas() automatically creates a DaskGeoDataFrame from a GeoDataFrame. I have a DaskDataFrame with point data as coordinates and would like to create a DaskGeoDataFrame without having to load the data in memory? If this is not yet possible, can you think of a good way to implement this functionality? Thanks!

Describe the solution you'd like

A utility method like

from_dask_dataframe(existing_dask_dataframe, geometry=geom_col_name)

Describe alternatives you've considered

Nothing that has worked.

So far I'm able to map partitions and create GeoSeries

import spatialpandas as spd
def to_gs(df):
    points = spd.geometry.PointArray((df['x'], df['y']))
    spgs = spd.GeoSeries(points, index=df.index)
    return spgs

ddf = dd.read_parquet(bigfiles)
spdgeom = ddf.map_partitions(to_gs)
gsddf = ddf.assign(geometry=spdgeom)
gsddf.dtypes

-------------------
id            string
x             float64
y             float64
geometry       point[float64]

Seems like a step away from DaskGeoDataFrame, but can't find a way. Any pointers are appreciated.

Additional context

jbednar commented 3 years ago

See load_data in https://examples.pyviz.org/ship_traffic/ship_traffic.html ; does that help?

brl0 commented 3 years ago

This might help:

def to_gs(df):
    points = spd.geometry.PointArray((df['x'], df['y']))
    spgs = spd.GeoSeries(points, index=df.index)
    return GeoDataFrame(dict(position=spgs, **{col: df[col] for col in df.columns}))

ddf = dd.read_parquet(bigfiles)
spdgeom = ddf.map_partitions(to_gs)

Spatialpandas registers the DaskGeoDataFrame as the parallel type for GeoDataFrame, so operations like map_partitions or even dd.from_pandas can be used to create a DaskGeoDataFrame.

sigint8 commented 3 years ago

That did help @jbednar and @brl0 . Thank you!

I had unsuccessfully tried creating the GeoDataFrame directly. Specifying the meta was the tricky bit, which your example helped with. I always find myself tripping up on the meta.

Again, thanks for the pointer. Closing this issue.