pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.05k stars 1.94k forks source link

`read_parquet` don't recognize OSS url scheme #16737

Open ZhenjieRuan opened 4 months ago

ZhenjieRuan commented 4 months ago

Checks

Reproducible example

url = "oss://any-oss-bucket"
aliyun_config = json.loads(Path("~/.aliyun/config.json").expanduser().read_text())
profile = aliyun_config["profiles"][0]
storage_options = {
    "key": profile["access_key_id"],
    "secret": profile["access_key_secret"],
    "token": profile["sts_token"],
    "endpoint": OSS_ENDPOINT,
}
pl.read_parquet("url", storage_options=storage_options)

Log output

~/conda_dev/devenv/Linux/envs/devenv-3.8-c/lib/python3.8/site-packages/polars/_utils/deprecation.py in wrapper(*args, **kwargs)
    132                 old_name, new_name, kwargs, function.__name__, version
    133             )
--> 134             return function(*args, **kwargs)
    135
    136         wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined]

~/conda_dev/devenv/Linux/envs/devenv-3.8-c/lib/python3.8/site-packages/polars/_utils/deprecation.py in wrapper(*args, **kwargs)
    132                 old_name, new_name, kwargs, function.__name__, version
    133             )
--> 134             return function(*args, **kwargs)
    135
    136         wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined]

~/conda_dev/devenv/Linux/envs/devenv-3.8-c/lib/python3.8/site-packages/polars/io/parquet/functions.py in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    178
    179     # For other inputs, defer to `scan_parquet`
--> 180     lf = scan_parquet(
    181         source,  # type: ignore[arg-type]
    182         n_rows=n_rows,

~/conda_dev/devenv/Linux/envs/devenv-3.8-c/lib/python3.8/site-packages/polars/_utils/deprecation.py in wrapper(*args, **kwargs)
    132                 old_name, new_name, kwargs, function.__name__, version
    133             )
--> 134             return function(*args, **kwargs)
    135
    136         wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined]

~/conda_dev/devenv/Linux/envs/devenv-3.8-c/lib/python3.8/site-packages/polars/_utils/deprecation.py in wrapper(*args, **kwargs)
    132                 old_name, new_name, kwargs, function.__name__, version
    133             )
--> 134             return function(*args, **kwargs)
    135
    136         wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined]

~/conda_dev/devenv/Linux/envs/devenv-3.8-c/lib/python3.8/site-packages/polars/io/parquet/functions.py in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, rechunk, low_memory, cache, storage_options, retries)
    392         source = [normalize_filepath(source) for source in source]
    393
--> 394     return _scan_parquet_impl(
    395         source,
    396         n_rows=n_rows,

~/conda_dev/devenv/Linux/envs/devenv-3.8-c/lib/python3.8/site-packages/polars/io/parquet/functions.py in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, glob, hive_schema, retries)
    439         storage_options = None
    440
--> 441     pylf = PyLazyFrame.new_from_parquet(
    442         source,
    443         sources,

ComputeError: unknown url scheme

Issue description

Read parquet doesn't recognize oss url. However, the url works for read_csv, and the url + storage option works for fsspec.open(). Per the doc, If the cloud provider is not supported by Polars, the storage options are passed to fsspec.open(). Clearly, it's not the case here. From the stack trace, looks like read_parquet somehow called scan_parquet, and per the doc, looks like scan_parquet won't forward unsupported url to fsspec.open(), this might be the issue here.

Expected behavior

Should be able to read oss files directly

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Linux-5.4.0-170-generic-x86_64-with-glibc2.10 Python: 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:53:36) [GCC 9.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: 3.6.0 nest_asyncio: 1.5.6 numpy: 1.23.5 openpyxl: 2.6.3 pandas: 1.2.3 pyarrow: 8.0.0 pydantic: 1.8.2 pyiceberg: pyxlsb: sqlalchemy: 2.0.5.post1 torch: xlsx2csv: xlsxwriter: ```
calpaterson commented 4 months ago

I've just run into what I think it the same bug with my fsspec adapter for csvbase.

Using fsspec for Parquet doesn't work in Polars


import polars as pl

df = pl.read_parquet("<some fsspec url>") # raises FileNotFoundError

But however it does work in Pandas


import pandas as pd

df = pd.read_parquet("<some fsspec url>") # works ok

and duckdb as well. So it would be great to get support for this in Polars.