pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.98k stars 1.94k forks source link

When working with s3fs (For AWS S3), it still raises "Polars found a filename" warning #18040

Open MacHu-GWU opened 2 months ago

MacHu-GWU commented 2 months ago

Checks

Reproducible example

import polars as pl
import s3fs

df = pl.DataFrame({
    "foo": ["a", "b", "c", "d", "d"],
    "bar": [1, 2, 3, 4, 5],
})

fs = s3fs.S3FileSystem()
destination = "s3://bucket/my_file.parquet"

# write parquet
with fs.open(destination, mode='wb') as f:
    df.write_parquet(f)

Log output

/Users/sanhehu/Documents/GitHub/polars_aws-project/polars_aws/s3/_write_parquet.py:56: UserWarning: Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.
    df.write_parquet(f, **kwargs)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Issue description

I am following the code in https://docs.pola.rs/user-guide/io/cloud-storage/#writing-to-cloud-storage, how ever, it still raises the "UserWarning: Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance." warning. I guess it's because the example doesn't match the polars recommended best practice.

Expected behavior

Should not have warning

Installed versions

``` --------Version info--------- Polars: 1.4.0 Index type: UInt32 Platform: macOS-14.3-arm64-arm-64bit Python: 3.10.10 (main, Feb 20 2024, 22:22:03) [Clang 15.0.0 (clang-1500.1.0.2.5)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
deanm0000 commented 2 months ago

I'm not sure if gcs and s3 open file objects have fs attribute but files opened with adlfs have fs. If the other two do then I think we can just add a check here

https://github.com/pola-rs/polars/blob/fd00ee6baa562ad24280135c7d9cd23e249dbf45/py-polars/src/file.rs#L267-L271

so in addition to skipping the warning for BytesIO also skip the warning if py_f.hasattr('fs').