holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
308 stars 25 forks source link

DaskGeoDataFrame parquet write error - Series object has no attribute total_bounds #138

Open 4andy opened 7 months ago

4andy commented 7 months ago

Hi - I'm running into an error when trying to write a DaskGeoDataFrame. I'm following the basic pattern here (see also) but using a smaller sample of a point dataset. Everything seems to run as expected until trying to write out the packed file and I encounter the error below.

ALL software version info

pyarrow =15.0.0 spatialpandas=0.4.10 pandas=2.1.1 dask=2024.2.0 python=3.9.16

df = df.pack_partitions(npartitions=df.npartitions, shuffle='disk')
df.to_parquet(save_path)

image

image

4andy commented 7 months ago

I was able to get a small file written without error but I still encounter the error with a large dataset.

I re-ran on a different system with pandas 2.2.1 and again with pandas 1.5.3 and encountered the error each time. Any ideas are appreciated. Here is a more complete stack trace image

4andy commented 7 months ago

If there is only one Dataframe partition saving works fine - if there is > 1 partition, this error is returned.

hoxbro commented 7 months ago

I would guess that this was implemented with fastparquet, which has now been dropped by Dask. Can you try downgrading the Dask version to something like 2020 and see if that will work with/without fastparquet.

4andy commented 7 months ago

Thanks for that idea @Hoxbro. I downgraded dask to 2020 but it returns the same error.

So far in looking into the issue I found that any call to df.geometry.total_bounds after df.pack_partitions() raises the error. However, you can call the total_bounds property any number of times before packing partitions and it returns correctly.

hoxbro commented 7 months ago

Did you try to set the parquet backend to fastparquet?

4andy commented 7 months ago

I did try fastparquet (same error). However, I don't think it's related to that or to saving directly. Something happens with pack_partitions that causes and future calls to the geometry.total_bounds property to fail. It's failing at save because to_parquet makes calls to that property.

4andy commented 7 months ago

I found a trigger condition for the error - it occurs when one or more longitudes are negative. I attached a simple notebook that reproduces the error. If you change the negative longitude to positive the error is resolved. Not sure where to look in the code to patch this. Thanks! sp_error_example.ipynb.txt