pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.97k stars 1.71k forks source link

"has_header" in "read_options" for `read_excel` has inconsistent behavior across engines #16944

Open toan-quach opened 3 weeks ago

toan-quach commented 3 weeks ago

Checks

Reproducible example

If I use the default engine option ("xlsx2csv") it returns the expected result Code:

pl_data = pl.read_excel(
    "example.xlsx", read_options={"has_header": False}, sheet_name="Sheet1"
)

Result: image

If I use "openpyxl" as the engine, it returns an unexpected result Code:

pl_data = pl.read_excel(
    "example.xlsx", read_options={"has_header": False}, sheet_name="Sheet1", engine="openpyxl"
)

Result: image

File used: example.xlsx

Log output

No response

Issue description

When I use the default engine xlsx2csv to read the Excel file without header, it returns the correct result, with the "supposed" header being one of the rows in the DataFrame. But when I switch to openpyxl (the engine I'm using) and read the file again, also with no header, the "supposed" header is now read as the header and not one of the rows of the DataFrame. I have provided images and code to reproduce.

Expected behavior

I want to use openpyxl as my engine and read the Excel file without a header. I expect the 1st row in the example.xlsx file to be 1 of the row in the DataFrame and not as the DataFrame header.

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: macOS-13.6.3-arm64-arm-64bit Python: 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2023.12.2 gevent: 23.7.0 hvplot: matplotlib: 3.9.0 nest_asyncio: 1.5.9 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 1.3.5 pyarrow: 14.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.16 torch: xlsx2csv: xlsxwriter: 3.2.0 ```
toan-quach commented 3 weeks ago

After reading more into the source code, I noticed that the read_options is passed to the "load" excel function of the engine. In openpyxl case, the load_workbook function doesn't have any ignore header related parameter. Hence, I concluded that this is a limitation of the openpyxl engine itself.

@ritchie46 Do let me know if my understanding is correct 😄