[ENH] mtype for multi-indexed `pandas` and `polars` dataframes

fkiraly commented 2 months ago

Specification and API consolidation discussion related to polars support for probabilistic predictions.

PR https://github.com/sktime/skpro/pull/399 makes it clear that, for full polars support, we need an internal representation for polars returns of predict_interval and predict_quantiles, and some means to convert between pandas and polars representation.

The PR proposes to extend the Table mtype, though I think that is dangerous as it would redefine the abstract datatype to include multi-index columns, which then would either affect, by a "chain of architecture", all mtypes in the Table scitype, where it is unclear what would, for instance, have to happen to pd.Series or numpy based ones.

However, the best way to proceed does not seem clear to me - hence the discussion issue.

Personally, I see two options to maintain a clearer structure, both leaving Table mtypes unchanged:

extend the existing Proba mtype with polars based ones. However, this would result in one polars based mtype per predict output, which seems now less clean a design than it originally seemed when there were only pandas based ones.
build a new scitype, TableMI (or similarly named), which has as ADT tables that can have a column multi-index (and, potentially, a row multi-index, but perhaps only later as we do not need this now). For the start, it can have three mtype-s, polars eager and lazy, and pd.DataFrame based.

We would then use the new concrete data structure in a converter in predict_interval and predict_quantiles, after the output has been produced.

julian-fong commented 2 months ago

I'm thinking about a potential third option, but need some more background on the 'proba' scitype

MTYPE_REGISTER_PROBA = [
    ("pred_interval", "Proba", "predictive intervals"),
    ("pred_quantiles", "Proba", "quantile predictions"),
    ("pred_var", "Proba", "variance predictions"),
    # ("pred_dost", "Proba", "full distribution predictions, tensorflow-probability"),
]

What are the possible datatypes that can be used for 'Proba' scitypes? i see pred_interval, pred_quantiles and pred_var as mtypes, but those don't indicate any explicit 'dataframe' types like the 'table' scitype does.

fkiraly commented 2 months ago

It's there as the expected outputs of the probabilitstic predict_ methods, just exactly following the specification of each expected output - I think this was not a good idea, since they do not have a common abstract data type, but I just used this for checking.

julian-fong commented 2 months ago

I think a potential idea would just have proba scitype be the 'tablemi' scitype, and then define a specific common abstract data type for it (I think it would be pretty simple, just extend the table to include mi-index and mi-columns).

If we can map a common abstract type similar to table we can reduce the hassle of introducing a new scitype and purpose 'proba' for this.

During set_output we can infer the proper scitype based on the function that is specified: 1) User calls predict -> follow the table scitype 2) User calls predict_proba/interval/var -> follow the 'proba' scitype (in this case it would be the multi-index indices and multi-index columns dataframes)

extend the existing Proba mtype with polars based ones. However, this would result in one polars based mtype per predict output, which seems now less clean a design than it originally seemed when there were only pandas based ones.

Would it be possible to leave it as 3 but extend the scope for each to include pandas and polars implicitly?

Would like to know your thoughts or if you see any obvious issues with this

fkiraly commented 2 months ago

Would like to know your thoughts or if you see any obvious issues with this

I think that would not work like this, because there is no common abstract data type in the Proba type, hence the conversions are unclear.

To see this, one would really have to write out the conversions and examples in all formality.

sktime / skpro

[ENH] mtype for multi-indexed `pandas` and `polars` dataframes #460