Fix read parquet from remote filesystems

holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations

BSD 2-Clause "Simplified" License

305 stars 24 forks source link

Fix read parquet from remote filesystems #115

Closed brl0 closed 1 year ago

brl0 commented 1 year ago

I noticed some of the path handling for parquet files would munge paths for remote filesystems and confirmed after testing with s3. This PR fixes that issue.

brl0 commented 1 year ago

Added tests which fail on main with these errors:

FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_glob_parquet - pyarrow.lib.ArrowInvalid: Missing bucket name in S3 URI
FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_glob_all - pyarrow.lib.ArrowInvalid: Missing bucket name in S3 URI
FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_dir - pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'test_bucket/test_dask/part.1.parquet', which is outside base dir 's3://test_bucket/test_dask'
FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_dir_slash - pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'test_bucket/test_dask/part.1.parquet', which is outside base dir 's3://test_bucket/test_dask/'

brl0 commented 1 year ago

I believe this PR should be ready for review. Let me know if there are any questions or concerns, any feedback is always appreciated.

brl0 commented 1 year ago

@ianthomas23, thanks for reviewing. I have made the requested changes.

Currently, one of the tests has a failure due to a 503 from GitHub, but that can be rerun once the other tests are complete.

If compatibility with those versions of pyarrow is expected, it seems like it would be worth parameterizing just the parquet tests over the supported versions, but given the run time of the full test suite, it does seem unnecessary to do so with the full test suite.

ianthomas23 commented 1 year ago

If compatibility with those versions of pyarrow is expected, it seems like it would be worth parameterizing just the parquet tests over the supported versions, but given the run time of the full test suite, it does seem unnecessary to do so with the full test suite.

I agree, but I don't want to make any changes at the moment. There has been a long tail of support for old python and pyarrow versions, and we are starting to bring that back to something sensible, e.g. dropping python 3.7 now and 3.8 will not be too far behind, and there will be corresponding dropping of pyarrow version support. We are also working on an overhaul of the project build and CI scripts on all holoviz repositories, and I expect we will end up with a "minimal versions" CI run to test that CI passes using all of the important minimum versions of dependencies.