holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
305 stars 24 forks source link

ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API #109

Closed rderollepot closed 1 year ago

rderollepot commented 1 year ago

Hey guys,

I'm having this ValueError that I suppose https://github.com/holoviz/spatialpandas/pull/92 intended to fix, but there still is a validate_schema=False in the following snippet of code that should probably have been removed: https://github.com/holoviz/spatialpandas/blob/a858a289c0d8817480c695d345146c6076c9a1bd/spatialpandas/io/parquet.py#L174-L181

Replacing:

validate_schema=False,

by:

#validate_schema=False,
use_legacy_dataset=False,

as it was done in https://github.com/holoviz/spatialpandas/pull/92 indeed did the trick for me.

ALL software version info

macOS
Python 3.10
geopandas 0.12.0
spatialpandas 0.4.6
pyarrow 11.0.0
dask 2023.2.0

Complete, minimal, self-contained example code that reproduces the issue

Here is my case:

1) Convert a geopandas GeoDataFrame into a spatialpandas one
2) Convert it into a DaskGeoDataFrame
3) Write it to disk with to_parquet_dask()
4) Read it with read_parquet_dask()
5) ... any call on the DaskGeoDataFrame that will call compute() triggers the bug

Stack traceback and/or browser JavaScript console output

Traceback (most recent call last):
  File "MyProj/romain/dashboard/app.py", line 31, in <module>
    edges_ddf["highway"] = edges_ddf["highway"].cat.as_known()
  File "MyProj/testenv/lib/python3.10/site-packages/dask/dataframe/categorical.py", line 218, in as_known
    categories = self._property_map("categories").unique().compute(**kwargs)
  File "MyProj/testenv/lib/python3.10/site-packages/dask/base.py", line 314, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "MyProj/testenv/lib/python3.10/site-packages/dask/base.py", line 599, in compute
    results = schedule(dsk, keys, **kwargs)
  File "MyProj/testenv/lib/python3.10/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "MyProj/testenv/lib/python3.10/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "MyProj/testenv/lib/python3.10/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "MyProj/testenv/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "MyProj/testenv/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "MyProj/testenv/lib/python3.10/site-packages/dask/core.py", line 119, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
  File "MyProj/testenv/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "MyProj/testenv/lib/python3.10/site-packages/dask/utils.py", line 72, in apply
    return func(*args, **kwargs)
  File "MyProj/testenv/lib/python3.10/site-packages/spatialpandas/io/parquet.py", line 175, in read_parquet
    df = pq.ParquetDataset(
  File "MyProj/testenv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1763, in __new__
    return _ParquetDatasetV2(
  File "MyProj/testenv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2394, in __init__
    raise ValueError(
ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API
ianthomas23 commented 1 year ago

@rderollepot Thanks for reporting this. Tests pass fine with the previous release of pyarrow (10.0.1) but fail with the latest release (11.0.0). I will investigate further.