pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.22k stars 1.95k forks source link

Regression from v0.19 -> v0.20 Can't read parquet file via ssh #16353

Open alexanderpils opened 5 months ago

alexanderpils commented 5 months ago

Checks

Reproducible example


import polars as pl
import fsspec

host = '*******'
username = '******'

df = pl.scan_parquet(f"ssh://{username}@{host}/home/alexander.pils/test.parquet")
df.collect()

pl.__version__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
FileNotFoundError: No such file or directory (os error 2): ssh://*******@*******/home/alexander.pils/test.parquet
'0.20.26'

Log output

No response

Issue description

With Polars 0.19.19 the code snippet above runs without problems, so it was possible to access parquet file via ssh. But since 0.20.* this isn't anymore possible. I know from v0.19 to v.0.20 was the change from fsspec to object store but in the documentation at least for read_parquet it is stated that it will still use fsspec if the cloud provider is not supported by Polars as far as I understand it.

Expected behavior

Should work like in v0.19.*

Installed versions

``` >>> pl.show_versions() --------Version info--------- Polars: 0.20.26 Index type: UInt32 Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: 3.5.1 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
ritchie46 commented 5 months ago

This was never intended to work, but a side effect of ffspec. We moved away from ffspec as that was a temporary solution.