pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.51k stars 17.88k forks source link

BUG: read_csv with chained fsspec TAR file and compression="infer" fails with tarfile.ReadError #60028

Open mxmlnkn opened 1 day ago

mxmlnkn commented 1 day ago

Pandas version checks

Reproducible Example

import tarfile

with open("test.csv", "wt") as file:
    file.write("1;2\n3;4")
with tarfile.open("test-csv.tar", "w") as archive:
    archive.add("test.csv")

import pandas as pd

data = pd.read_csv("tar://test.csv::file://test-csv.tar", compression=None)  # works
print(data)
data = pd.read_csv("tar://test.csv::file://test-csv.tar")  # does not work

Issue Description

For chained URLs, the file gets misidentified as TAR, which leads to this backtrace:

Traceback (most recent call last):
  File "/projects/ratarmount/worktrees/1/trigger-pandas-bug.py", line 12, in <module>
    data = pd.read_csv("tar://test.csv::file://test-csv.tar", compression="infer")  # does not work
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/pandas/io/common.py", line 828, in get_handle
    handle = _BytesTarFile(
             ^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/pandas/io/common.py", line 991, in __init__
    self.buffer: tarfile.TarFile = tarfile.TarFile.open(  # type: ignore[assignment]
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/tarfile.py", line 1842, in open
    raise ReadError(f"file could not be opened successfully:\n{error_msgs_summary}")
tarfile.ReadError: file could not be opened successfully:
- method gz: ReadError('not a gzip file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('truncated header')

I have checked the source code, and the problem seems to be that the full URL is checked for ending with a TAR extension. Instead, only the last part in the chain should be checked, i.e., it should check the extension of tar://test.csv not tar://test.csv::file://test-csv.tar.

https://github.com/pandas-dev/pandas/blob/2a10e04a099d5f1633abcdfbb2dd9fdf09142f8d/pandas/io/common.py#L593-L594

Expected Behavior

It should work without an error.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2a10e04a099d5f1633abcdfbb2dd9fdf09142f8d python : 3.12.3 python-bits : 64 OS : Linux OS-release : 6.8.0-45-generic Version : #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+1579.g2a10e04a0 numpy : 1.26.4 dateutil : 2.9.0.post0 pip : 24.2 Cython : 3.0.11 sphinx : None IPython : 8.20.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None fastparquet : None fsspec : 2024.9.0 html5lib : 1.1 hypothesis : None gcsfs : 2024.9.0post1 jinja2 : 3.1.4 lxml.etree : 5.2.1 matplotlib : 3.6.3 numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : 15.0.2 pyreadstat : None pytest : 8.3.3 python-calamine : None pytz : 2024.1 pyxlsb : None s3fs : 2024.9.0 scipy : 1.11.4 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None xlsxwriter : None zstandard : 0.23.0 tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 1 day ago

Thanks for the report. I'm not familiar with these chained URLs, is there a formal specification for them? Some searching did not turn up anything for me.

mxmlnkn commented 1 day ago

Fsspec documents it here: https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining The Pandas documentation mentions it at the end of this section: https://pandas.pydata.org/docs/user_guide/io.html#reading-writing-remote-files