pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.94k forks source link

ENH: read_excel should assume engine when passing openpyxl.Workbook object #46352

Open scotscotmcc opened 2 years ago

scotscotmcc commented 2 years ago

Is your feature request related to a problem?

When your io parameter for pd.read_excel() is an openpyxl.Workbook object, the function should assume that engine='openpyxl'. Right now, if you don't specify the engine, you will get a ValueError: Invalid file path or buffer object type: <class 'openpyxl.workbook.workbook.Workbook'>. However, if you explicitly pass the engine, it will load fine.

It seems like passing in the specific workbook object would only ever be done if you want it to read it with that engine, so we should just assume that engine.

The current documentation for read_excel() specifies that you can use an xlrd.Book, and that works fine without an explicit engine. It doesn't say that you can use an openpyxl.Workbook at all, but it does work when you declare the engine.

This has come up as a small inconvenience for me a few times. I have a script where I am pulling some data out of Excel that is not a table at all, and so I'm using opening it as an openpyxl.Workbook and using some of the tools from that library, and then also want to pull another table using pandas. I want to do this without opening the workbook multiple times, so I use the same workbook to pass into pd.read_excel(). It then throws and error and I say "Oh yea, I need the engine."

Describe the solution you'd like

One should be able to do

import pandas as pd
import openpyxl

wb = openpyxl.load_Workbook(filepath,...)
df = pd.read_excel(wb,...) # and not include engine
wb.close()

This should work the same as just df = pd.read_excel(filepath,...)

Describe alternatives you've considered

Alternatives seem to be (1) explicitly pass engine or (2) don't use openpyxl.Workbook objects.

Potential Solution

# starting line 1466 or so of pandas.io.excel_base.py
if isinstance(path_or_buffer, openpyxl.Workbook):
    if engine == "openpyxl":
        pass
    elif engine is None:
        engine = "openpyxl"
    else:
        raise ValueError(f"Excel object is openpyxl.Workbook, but engine is {engine}")
scotscotmcc commented 2 years ago

That potential solution of mine above doesn't work well. For one, it only checks for openpyxl, not the other handful of engines/libraries. Second, openpyxl isn't actually imported in that file, so this fails.

I was working on another potential solution that introduces a new function to handle it and runs through a series of the import_optional_dependency() calls and checks the object against the type from the different libraries.