pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.4k stars 17.83k forks source link

BUG: Inconsistent stream processing between read_excel and read_csv #56959

Open jf2 opened 8 months ago

jf2 commented 8 months ago

Pandas version checks

Reproducible Example

import pandas as pd

df = pd.DataFrame({
    "A": [.1, .2, .3],
    "B": ["a", "b", "c"]
})
df.to_excel("test.xlsx", index=False)
df.to_csv("test.csv", index=False)

with open("test.xlsx", "rb") as f:
    df_excel_1 = pd.read_excel(f)
    # the below line should fail or return an empty dataframe but doesn't
    df_excel_2 = pd.read_excel(f)

# instead, we have that pd.testing.assert_frame_equal(df_excel_1, df_excel_2)

with open("test.csv", "r") as f:
    df_csv_1 = pd.read_csv(f)
    # the below line will fail with EmptyDataError, as expected
    df_csv_2 = pd.read_csv(f)

Issue Description

read_excel does not honor the starting position of the stream.

The problem is likely in pandas.io.excel._base.py:580 which calls .seek(0)

576:    if isinstance(self.handles.handle, self._workbook_class):
577:        self.book = self.handles.handle
578:    elif hasattr(self.handles.handle, "read"):
579:        # N.B. xlrd.Book has a read attribute too
580:        self.handles.handle.seek(0)
581:        try:
582:            self.book = self.load_workbook(self.handles.handle, engine_kwargs)
583:        except Exception:
584:            self.close()
585:            raise
586:    else:
587:        raise ValueError(
588:            "Must explicitly set engine if not passing in buffer or path for io."
589:        )

Why this is incorrect behaviour

I think calling seek(0) is anti-pattern in how streams are treated. For one, this ignores the current position of the stream for no apparent good reason. As second, a stream is not even required to implement a seek function to be considered a valid stream.

Hence, for example, you see other reports where people are passing a stream to read_excel that does not implement seek, such as, e.g. Requests response (https://github.com/pandas-dev/pandas/issues/28825). The latter issue appears to be resolved, but looks like it's been resolved by considering the request Response as a special case that is handled differently.

Path to solution

First would be to remove seek(0) call altogether.

If load_workbook requires the stream to be seekable, however, the handle passed to load_workbook should be wrapped into a BytesIO wrapper. In either case, the stream should be read till the end and/or until a terminating condition is met (I don't know enough about Excel internals) and then left there, and left to the subsequent consumer to either continue or to the caller to close the stream.

As an aside, the call to close the stream should only happen if the ExcelReader is the one that opened it. Otherwise, it should be the caller's responsibility to close it.

Expected Behavior

read_excel should start reading bytes from the current position of the stream and should not reset it.

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.3.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_United Kingdom.1252 pandas : 2.1.4 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.3.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.4 qtpy : None pyqt5 : None ```
WillAyd commented 8 months ago

Sure - if this can be removed and still pass the test suite I would be OK with removing seek. There might be some historical cruft to that being in the there in the first place.

Did you want to try that in a PR @jf2?

jmarintur commented 6 months ago

Hi @WillAyd, I've been playing around (I've even removed the seek line), and this doesn't seem to be a pandas bug. If you use openpyx1 to read the file, you get to read it as many times as you want without reopening the workbook:

import openpyxl
import pandas as pd

df = pd.DataFrame({
    "A": [.0, .1],
    "B": ["a", "b"]
})
df.to_excel("test.xlsx", index=False)

wb = openpyxl.load_workbook("test.xlsx")
sheet = wb.active

rows_1 = list(sheet.values)
rows_2 = list(sheet.values)
rows_3 = list(sheet.values)

Could this be a feature on how excel files are handled? Wdyt?