pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.34k stars 1.86k forks source link

Expose the individual parameters from fastexcel.load_sheet in pl.read_excel #17727

Open deanm0000 opened 2 months ago

deanm0000 commented 2 months ago

Description

I think everybody likes type hints but with read_excel we don't get type hints even for the default choice of calamine. With calamine the default I contend we should push those options through directly like this:

def read_excel(
    source: str | Path | IO[bytes] | bytes,
    *,
    sheet_id: int | Sequence[int] | None = None,
    sheet_name: str | list[str] | tuple[str] | None = None,
    engine: ExcelSpreadsheetEngine = "calamine",
    engine_options: dict[str, Any] | None = None,
    read_options: dict[str, Any] | None = None,
    columns: Sequence[int] | Sequence[str] | None = None,
    schema_overrides: SchemaDict | None = None,
    infer_schema_length: int | None = N_INFER_DEFAULT,
    raise_if_empty: bool = True,
    header_row: int | None = -999, # Since None is meaningful set this to -999 as default then change to 1 if is -999
    column_names: list[str] | None = None,
    skip_rows: int = None,
    n_rows: int | None = None,
    schema_sample_rows: int | None = None,
    dtype_coercion: Literal['coerce', 'strict'] | None = None,
    use_columns: Union[list[str], list[int], str, Callable[[ColumnInfo], bool], NoneType] = None,
    dtypes: 'DTypeMap | None' = None
) -> pl.DataFrame | dict[str, pl.DataFrame]:

and then do something like

if engine != 'calamine':
    expected_defaults={
    'header_row': -999,
    'column_names': None,
    'skip_rows': None,
    'n_rows': None,
    'schema_sample_rows': None = None,
    'dtype_coercion':  None,
    'use_columns':  None,
    'dtypes': None
    }

    for param, value in expected_defaults.items():
        if locals()[param]!=value:
            raise InvalidOperation(f"can't use {param} with {engine}, only with calamine")
if header_row == -999:
    header_row=0
if schema_sample_rows is None:
    schema_sample_rows=1000
if dtype_coercion is None:
    dtype_coercion='coerce' 

That way when people do pl.read_excel( they will see

image

Alternatively, we could do classes such as

class Calamineopts():
    def __init__(self, 
    idx_or_name=None,
    header_row = 0,
    column_names= None,
    skip_rows= 0,
    n_rows = None,
    schema_sample_rows = 1000,
    dtype_coercion = 'coerce',
    use_columns = None,
    dtypes= None):
        self.header_row=header_row
        self.column_names=column_names
        self.skip_rows=skip_rows
        self.n_rows=n_rows
        self.schema_sample_rows=schema_sample_rows
        self.dtype_coercion=dtype_coercion
        self.use_columns=use_columns
        self.dtypes=dtypes
    def to_dict(self):
        return dict(
        header_row=self.header_row,
        column_names=self.column_names,
        skip_rows=self.skip_rows,
        n_rows=self.n_rows,
        schema_sample_rows=self.schema_sample_rows,
        dtype_coercion=self.dtype_coercion,
        use_columns=self.use_columns,
        dtypes=self.dtypes
        )

and then users could do

import polars as pl
from polars.xlreaderopts import Calamineopts as co

pl.read_excel(myfile, reader_options = co(

and at that point they'd get type hints from the class.

Of course then pl.read_excel needs to check if read_options is that class and do to_dict

alexander-beedie commented 2 months ago

I'll take a look in a bit; I've started consolidating more of the common options into the top-level API for better clarity/consistency - this should lead to less need for the two options dicts, and a better overall user experience.