Open rhshadrach opened 6 months ago
Yes I agree we should deprecate ExcelFile.parse
, no reason to fix it.
And I'm not against deprecating ExcelFile
either.
+1 for deprecating ExcelFile
and ExcelFile.parse
To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read into memory only once.
Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?
@samukweku That is not lost if only pd.ExcelFile.parse
is deprecated while pd.ExcelFile
remains available (which is what I would favor as well).
Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?
pd.read_excel
@asishm kindly explain how it is not lost? @rhshadrach wont the pd.read_excel
option be less performant for multiple sheets( since with read_excel you read it more than once)?
@samukweku the read_excel()
function does allow you to read in multiple sheets at once by explicitly specifying the worksheets to be read in as values for the _sheetname parameter. Or you can pass None as a value to read in all sheets.
Albeit, when reading in multiple worksheets you are returned a dict of DataFrames.
@rhshadrach wont the
pd.read_excel
option be less performant for multiple sheets( since with read_excel you read it more than once)?
You can pass an ExcelFile
instance to read_excel
. It directly calls ExcelFile.parse
internally.
As noted in the user guide, the usecase below can't be achieved without a re-read with the current api of pd.read_excel(fp-like)
The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:
data = {} # For when Sheet1's format differs from Sheet2 with pd.ExcelFile("path_to_file.xls") as xls: data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"]) data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)
Thanks @asishm - that has me convinced that ExcelFile itself should stay.
what is supposed to be done in this issue ??
Can you please clarify it properly.
It seems to me we should either fix
ExcelFile.parse
or deprecate it entirely, and I lean toward the latter. pandas originally started out with justExcelFile
but now has the top-levelread_excel
. The signatures started the same, but nowread_excel
has gained and modified parameters that have not been added/changed inExcelFile.parse
. For example:ExcelFile.parse
lacks adtype
parameterExcelFile.parse
has a**kwds
argument that is passed on to pandas internals with no documentation on what can be included. Invalid arguments are just ignored (e.g. #50953)It appears to me that
pd.ExcelFile(...).parse(...)
offers no advantage overpd.read_excel(pd.ExcelFile(...))
, and so rather than fixingparse
we can deprecate it and make it internal.Edit: I no longer think deprecating
ExcelFile
entirely as mentioned below is a good option. See https://github.com/pandas-dev/pandas/issues/58247#issuecomment-2067632583.Another option is to deprecate
ExcelFile
entirely. The one thingExcelFile
still provides that isn't available elsewhere is to get the underlyingbook
orsheet_names
without reading the entire file.One can somewhat work around this by using
nrows
, but it's clunky.