pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

BUG: fsspec URL chaining support is not working #59684

Open Nitro963 opened 2 months ago

Nitro963 commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd
path = ('filecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz'
        '/SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv')

storage_options = {'s3': {'anon': True}, 'filecache': {'cache_storage': 'TMP', 'expiry_time': 259200}}

pd.read_csv(path, storage_options=storage_options) # <-- this produces the error

with fsspec.open(path,
                 **storage_options) as f:
    df = pd.read_csv(f) # <-- this will work properly

Issue Description

The provided code snippet will fail and raise a ValueError. Although that the provided URL is indeed a valid fsspec file path.

Traceback (most recent call last):
  File "/server_code/playground.py", line 31, in <module>
    pd.read_csv(path, storage_options=storage_options)
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/common.py", line 718, in get_handle
    ioargs = _get_filepath_or_buffer(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/common.py", line 441, in _get_filepath_or_buffer
    raise ValueError(
ValueError: storage_options passed with file object or non-fsspec file path

Expected Behavior

pd.read_* should integrate out of the box with the provided storage_option as it contain valid keyword arguments for fsspec.open

Installed Versions

INSTALLED VERSIONS ------------------ commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.9.final.0 python-bits : 64 OS : Linux OS-release : 5.15.153.1-microsoft-standard-WSL2 Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : C.UTF-8 LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.4 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 72.1.0 pip : 24.2 Cython : 3.0.11 pytest : 8.2.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : 3.7.5 numba : 0.60.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None pyxlsb : None s3fs : 2024.6.1 scipy : 1.11.4 sqlalchemy : 2.0.32 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 2 months ago

Thanks for the report! pandas compares the supplied path to the spec as specified in RFC 3986 and concludes it is not a URI. Namely, the use of :: in the scheme section. See https://datatracker.ietf.org/doc/html/rfc3986#section-3.1 for details.

Possible options are to leave the logic as-is or to just assume a URI when a user passes storage_options. It's not clear to me what the best path forward is, further thoughts and investigations are welcome!