sktime / skpro

A unified framework for tabular probabilistic regression and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
231 stars 45 forks source link

[ENH] dummy supervised regressor with polars support #440

Open julian-fong opened 1 month ago

julian-fong commented 1 month ago

Implement the DummyProbaRegressor but with complete end to end support in skpro.

Some current limitations:

fit inside DummyProbaRegressor uses skpro.distributions which only supports pandas dataframes - needs a workaround

predict_proba also uses skpro.distributions - leading to the same issue, will need a workaround as well

@fkiraly any suggestions on how to implement?

julian-fong commented 1 month ago

@fkiraly I've come into a problem with the current implementation for polars support in skpro.

if an estimator specifies

"X_inner_mtype": "polars_eager_table",
"y_inner_mtype": "polars_eager_table",

Then during the tests, pandas DataFrames will get converted into polars dataframes via check_X in the boilerplate code in regression.base but they will lose their index

Since the index is already lost via the boilerplate code check_X, it is not retrievable when calling the private methods (since the input is already in polars dataframe format without the index). This will then fail subsequent index asserts in test files after the DataFrame is converted back into a pandas DataFrame via the convert function.

fkiraly commented 1 month ago

Interesting - I thought it saved the index as a variable __index__ if it was not a range index.

Or, is that only in the sktime implementation by @pranavvp16 ?

julian-fong commented 1 month ago

I think that would be in the sktime implementation, we do not save the index anywhere currently in the boilerplate if the incoming mtype is in polars format

fkiraly commented 1 month ago

May I suggest to try syncing the two implementations? I think the sktime type by @pranavvp16 stores non-range index as a reserved variable.