koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

perf: make sure X.schema is only calculated once per dataframe in TypeSelector #676

Closed MarcoGorelli closed 5 months ago

MarcoGorelli commented 5 months ago

Description

Refactors TypeSelector to use Narwhals selectors, as opposed to hard-coding types

This is preferable because, for Polars LazyFrames, getting the schema isn't a free operation (especially if the original dataframe is on the cloud - LazyFrame.schema may get renamed to LazyFrame.collect_schema for this reason)

I've also expanded the tests

Type of change

Checklist:

MarcoGorelli commented 5 months ago

I have one concern thought: I realized that polars selectors select in the order they are given and do not maintain the original order.

that's a good point, thanks - it should be possible to preserve the order. df.schema needs calling here anyway - i'll update, making sure that df.schema is called no more than once per dataframe