sktime / skpro

A unified framework for tabular probabilistic regression, time-to-event prediction, and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
250 stars 46 forks source link

[ENH] dummy supervised regressor with polars support #440

Open julian-fong opened 4 months ago

julian-fong commented 4 months ago

Implement the DummyProbaRegressor but with complete end to end support in skpro.

Some current limitations:

fit inside DummyProbaRegressor uses skpro.distributions which only supports pandas dataframes - needs a workaround

predict_proba also uses skpro.distributions - leading to the same issue, will need a workaround as well

@fkiraly any suggestions on how to implement?

julian-fong commented 3 months ago

@fkiraly I've come into a problem with the current implementation for polars support in skpro.

if an estimator specifies

"X_inner_mtype": "polars_eager_table",
"y_inner_mtype": "polars_eager_table",

Then during the tests, pandas DataFrames will get converted into polars dataframes via check_X in the boilerplate code in regression.base but they will lose their index

Since the index is already lost via the boilerplate code check_X, it is not retrievable when calling the private methods (since the input is already in polars dataframe format without the index). This will then fail subsequent index asserts in test files after the DataFrame is converted back into a pandas DataFrame via the convert function.

fkiraly commented 3 months ago

Interesting - I thought it saved the index as a variable __index__ if it was not a range index.

Or, is that only in the sktime implementation by @pranavvp16 ?

julian-fong commented 3 months ago

I think that would be in the sktime implementation, we do not save the index anywhere currently in the boilerplate if the incoming mtype is in polars format

fkiraly commented 3 months ago

May I suggest to try syncing the two implementations? I think the sktime type by @pranavvp16 stores non-range index as a reserved variable.