when the input is a polars dataframe and the transformer given to OnEachColumn or OnSubFrame is a scikit-learn transformer which exposes the set_output API, we would like to call set_output(transform='polars') so that it returns us a polars dataframe.
However, in scikit-learn < 1.3, the 'polars' option is not available, only 'pandas'.
so we have a few options:
force scikit-learn >= 1.4 but that is quite constraining for users because it is very recent
basically re-implement set_output ourselves, which is not good because it duplicates code that is already in scikit-learn
use set_output(transform='pandas') and convert the result to polars with polars.from_pandas. This lets scikit-learn deal with the conversion from numpy, getting the column names from get_feature_names_out etc and produce a pandas dataframe which is easy to convert to polars. This is what this PR does. (for scikit-learn >= 1.4 it just uses set_output(transform='polars')
it does not complicate the code much, and avoids complicating the combinations of versions that we allow. it may also result in better error messages for custom estimators that don't support the set_output api.
In addition to doing the logic around scikit-learn 1.4 and pandas-> polars conversion this PR improves the checks of the transformer's output type
I think this may be a better option than #940
when the input is a polars dataframe and the transformer given to OnEachColumn or OnSubFrame is a scikit-learn transformer which exposes the
set_output
API, we would like to callset_output(transform='polars')
so that it returns us a polars dataframe. However, in scikit-learn < 1.3, the 'polars' option is not available, only 'pandas'. so we have a few options:set_output(transform='pandas')
and convert the result topolars
withpolars.from_pandas
. This lets scikit-learn deal with the conversion from numpy, getting the column names fromget_feature_names_out
etc and produce a pandas dataframe which is easy to convert to polars. This is what this PR does. (for scikit-learn >= 1.4 it just usesset_output(transform='polars')
it does not complicate the code much, and avoids complicating the combinations of versions that we allow. it may also result in better error messages for custom estimators that don't support the set_output api.
In addition to doing the logic around scikit-learn 1.4 and pandas-> polars conversion this PR improves the checks of the transformer's output type