pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Combine `engine_options` and `read_options` into a single parameter in `read_excel` #17265

Open mcrumiller opened 3 months ago

mcrumiller commented 3 months ago

Description

Somewhat related to #17263.

read_excel has two parameters, engine_options and read_options:

This is really confusing from a user perspective. What's the difference between engine options that parse the data and engine options that reads the data? IMO, it would be easier to provide a universal set of options that are converted to the engine-specific options behind the scenes. All engines provide the same functionality for reading at this point, so it would be nice to instead have the following parameters (feel free to add):

stinodego commented 3 months ago

I'm pretty sure @alexander-beedie already has a design for this (see https://github.com/pola-rs/polars/pull/15808#issuecomment-2067944349). Not sure it will make it into 1.0.0 though.

alexander-beedie commented 3 months ago

I do, and it's "mostly" option 3, where the most common options are all exposed at the top-level and we convert them to engine-specific calls ourselves so the user doesn't have to (as we're now doing with schema_overrides, for example).

We will still maintain read_options and engine_options for what should become a vanishingly small number of cases not handled by the top-level parameters.

mcrumiller commented 3 months ago

Thanks; should I go ahead and close this one?

alexander-beedie commented 3 months ago

Thanks; should I go ahead and close this one?

It's ok, can leave it here so I have something to focus on and close when it's done (hopefully very shortly) 🤣

alexander-beedie commented 3 months ago

FYI: #17263 now contains a common columns param - more to come...

alexander-beedie commented 1 month ago

...and just added a common has_header param for all engines too👌

acxz commented 4 weeks ago

I'd like to comment on the suggestion for range with a +1. Often times spreadsheets have extra information in additional cells, so it's important to build the dataframe from a specified range or something like an excel table (in calamine and usage)