pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

DEPR: ExcelFile.parse #58247

Open rhshadrach opened 6 months ago

rhshadrach commented 6 months ago

It seems to me we should either fix ExcelFile.parse or deprecate it entirely, and I lean toward the latter. pandas originally started out with just ExcelFile but now has the top-level read_excel. The signatures started the same, but now read_excel has gained and modified parameters that have not been added/changed in ExcelFile.parse. For example:

It appears to me that pd.ExcelFile(...).parse(...) offers no advantage over pd.read_excel(pd.ExcelFile(...)), and so rather than fixing parse we can deprecate it and make it internal.

Edit: I no longer think deprecating ExcelFile entirely as mentioned below is a good option. See https://github.com/pandas-dev/pandas/issues/58247#issuecomment-2067632583.

Another option is to deprecate ExcelFile entirely. The one thing ExcelFile still provides that isn't available elsewhere is to get the underlying book or sheet_names without reading the entire file.

df = pd.DataFrame(np.zeros((100, 100)))
with pd.ExcelWriter("test.xlsx") as writer:
    for e in range(10):
        df.to_excel(writer, sheet_name=str(e))

%timeit pd.ExcelFile("test.xlsx").sheet_names
# 14.1 ms ± 76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_excel("test.xlsx", sheet_name=None)
# 411 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One can somewhat work around this by using nrows, but it's clunky.

%timeit pd.read_excel("test.xlsx", sheet_name=None, nrows=0).keys()
# 57.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Aloqeely commented 6 months ago

Yes I agree we should deprecate ExcelFile.parse, no reason to fix it.

And I'm not against deprecating ExcelFile either.

rmhowe425 commented 6 months ago

+1 for deprecating ExcelFile and ExcelFile.parse

samukweku commented 6 months ago

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read into memory only once.

Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?

asishm commented 6 months ago

@samukweku That is not lost if only pd.ExcelFile.parse is deprecated while pd.ExcelFile remains available (which is what I would favor as well).

rhshadrach commented 6 months ago

Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?

pd.read_excel

samukweku commented 6 months ago

@asishm kindly explain how it is not lost? @rhshadrach wont the pd.read_excel option be less performant for multiple sheets( since with read_excel you read it more than once)?

rmhowe425 commented 6 months ago

@samukweku the read_excel() function does allow you to read in multiple sheets at once by explicitly specifying the worksheets to be read in as values for the _sheetname parameter. Or you can pass None as a value to read in all sheets.

Albeit, when reading in multiple worksheets you are returned a dict of DataFrames.

image

rhshadrach commented 6 months ago

@rhshadrach wont the pd.read_excel option be less performant for multiple sheets( since with read_excel you read it more than once)?

You can pass an ExcelFile instance to read_excel. It directly calls ExcelFile.parse internally.

asishm commented 6 months ago

As noted in the user guide, the usecase below can't be achieved without a re-read with the current api of pd.read_excel(fp-like)

The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:

data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile("path_to_file.xls") as xls:
    data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
    data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)
rhshadrach commented 6 months ago

Thanks @asishm - that has me convinced that ExcelFile itself should stay.

GAuravY19 commented 5 months ago

what is supposed to be done in this issue ??

Can you please clarify it properly.