holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
308 stars 24 forks source link

dask `2021.8.0` breaks parquet io #84

Closed iameskild closed 2 years ago

iameskild commented 3 years ago

Opening issue here for visibility and to track any potential updates.

ALL software version info

(this library, plus any other relevant software, e.g. bokeh, python, notebook, OS, browser, etc)

python 
3.8.10

spatialpandas
0.4.3

dask
2021.8.0

Description of expected behavior and the observed behavior

Reading and writing to parquet is producing errors when used with the latest version of dask 2021.8.0. I have tested reverting back to dask 2021.7.2 and do not experience any of the issues outlined below.

The dask team may already be aware of this issue:

Complete, minimal, self-contained example code that reproduces the issue

import dask
import spatialpandas as spd
from spatialpandas.io import read_parquet_dask

path = "s3://bucketname/data.parquet"
storage_options = {"key": <key>, "secret": <secret>}

ddf = read_parquet_dask(path, storage_options=storage_options)

The code block above and ddf.pack_partitions_to_parquet both produce the following error:

Traceback (most recent call last):
  File "/app/datum/readers/quadrant.py", line 106, in sort_quadrant_files
    ddf_packed = ddf.pack_partitions_to_parquet(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/spatialpandas/dask.py", line 530, in pack_partitions_to_parquet
    return read_parquet_dask(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/spatialpandas/io/parquet.py", line 321, in read_parquet_dask
    result = _perform_read_parquet_dask(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/spatialpandas/io/parquet.py", line 421, in _perform_read_parquet_dask
    meta = dd_read_parquet(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
    read_metadata_result = engine.read_metadata(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 537, in read_metadata
    ) = cls._gather_metadata(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 1035, in _gather_metadata
    ds = pa_ds.parquet_dataset(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/pyarrow/dataset.py", line 457, in parquet_dataset
    factory = ParquetDatasetFactory(
  File "pyarrow/_dataset.pyx", line 2043, in pyarrow._dataset.ParquetDatasetFactory.__init__
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Extracting file path from RowGroup failed. The column chunks' file paths should be set, but got an empty file path.
jorisvandenbossche commented 3 years ago

I think at some point, dask/pyarrow wrote metdata files without the proper file path. The latest release of dask switched to use engine="pyarrow-dataset" as default, which might not be able to read such datasets. Can you try if passing engine="pyarrow-legacy" fixes it? (and if so, we should also see if we want to add some workarounds for this to the "pyarrow-dataset" engine)

julioasotodv commented 2 years ago

Having the exact same issue with ddf.pack_partitions_to_parquet(). I believe that there is no place in such function to include engine="pyarrow-legacy". @jorisvandenbossche in fact, cannot find that argument value documented anywhere in Dask.

ianthomas23 commented 2 years ago

This is fixed by #92. Tests all pass with dask 2021.8.0 (the first version that caused failures) and the latest 2022.7.1.