Add method (or parameter) to get sheet names from Excel file

pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

https://docs.pola.rs

Other

30.06k stars 1.94k forks source link

Add method (or parameter) to get sheet names from Excel file #18064

Open deanm0000 opened 2 months ago

deanm0000 commented 2 months ago

Description

I'm not sure if it's better to add a method or just a parameter to read_excel but add a means of getting a DataFrame/List/Dict/whatever of sheet names.

Something like

from polars.io.spreadsheet.functions import _initialise_spreadsheet_parser
_,_,sheets=_initialise_spreadsheet_parser('calamine', '/home/dean/pyjs/generic/windows/blah.xlsx',{})
pl.DataFrame(sheets)

alexander-beedie commented 2 months ago

Hmm. If you just want the sheet names, I don't think that really belongs in Polars 🤔

import polars as pl
import fastexcel

sheet_names = fastexcel.read_excel("test.xlsx").sheet_names
df = pl.DataFrame({"sheets": sheet_names})
# shape: (3, 1)
# ┌────────┐
# │ sheets │
# │ ---    │
# │ str    │
# ╞════════╡
# │ Sheet1 │
# │ Sheet2 │
# │ Sheet3 │
# └────────┘

And if you want the sheet names associated with table/frame data, that's already possible:

pl.read_excel("test.xlsx", sheet_id=0)
# {"Sheet1": ..., "Sheet2": ..., "Sheet3": ...}

cmdlineluser commented 2 months ago

@alexander-beedie Just for reference, it was asked by a user on SO:

https://stackoverflow.com/questions/78838584/how-do-i-get-all-sheet-names-in-an-excel-file-using-polars

But it seems like the sheet_id=0 approach is what they really want.

deanm0000 commented 2 months ago

The case for loading the sheet names only is if you have a giant excel file that you only need one sheet from. You know roughly what the name is but it's got names like "1 apple", "2 banana", "3 carrot", etc. Sometimes the source of the file skips a tab so you get "1 apple", "2 carrot", etc. If you only want "x rutabaga", you can't reliably just do "18 rutabaga", you have to get the sheet names and find the one that contains rutabaga.

Of course, the easiest way to do that is importing fastexcel to get the sheet names but, to me, it seems like if we're going to have pl.read_excel rather than telling people to load fastexcel and use its to_polars method then having a way to list the sheet names without the extra import seems in scope too.

@cmdlineluser I did see that one. I really dislike how pandas makes an ExcelFile class so I certainly wouldn't want to go that route.