Open fkiraly opened 2 months ago
I'm thinking about a potential third option, but need some more background on the 'proba' scitype
MTYPE_REGISTER_PROBA = [
("pred_interval", "Proba", "predictive intervals"),
("pred_quantiles", "Proba", "quantile predictions"),
("pred_var", "Proba", "variance predictions"),
# ("pred_dost", "Proba", "full distribution predictions, tensorflow-probability"),
]
What are the possible datatypes that can be used for 'Proba' scitypes? i see pred_interval
, pred_quantiles
and pred_var
as mtypes, but those don't indicate any explicit 'dataframe' types like the 'table' scitype does.
It's there as the expected outputs of the probabilitstic predict_
methods, just exactly following the specification of each expected output - I think this was not a good idea, since they do not have a common abstract data type, but I just used this for checking.
I think a potential idea would just have proba
scitype be the 'tablemi' scitype, and then define a specific common abstract data type for it (I think it would be pretty simple, just extend the table
to include mi-index and mi-columns).
If we can map a common abstract type similar to table
we can reduce the hassle of introducing a new scitype and purpose 'proba' for this.
During set_output
we can infer the proper scitype based on the function that is specified:
1) User calls predict
-> follow the table
scitype
2) User calls predict_proba/interval/var
-> follow the 'proba' scitype (in this case it would be the multi-index indices and multi-index columns dataframes)
- extend the existing
Proba
mtype withpolars
based ones. However, this would result in onepolars
based mtype perpredict
output, which seems now less clean a design than it originally seemed when there were onlypandas
based ones.
Would it be possible to leave it as 3 but extend the scope for each to include pandas
and polars
implicitly?
Would like to know your thoughts or if you see any obvious issues with this
Would like to know your thoughts or if you see any obvious issues with this
I think that would not work like this, because there is no common abstract data type in the Proba
type, hence the conversions are unclear.
To see this, one would really have to write out the conversions and examples in all formality.
Specification and API consolidation discussion related to
polars
support for probabilistic predictions.PR https://github.com/sktime/skpro/pull/399 makes it clear that, for full
polars
support, we need an internal representation forpolars
returns ofpredict_interval
andpredict_quantiles
, and some means to convert betweenpandas
andpolars
representation.The PR proposes to extend the
Table
mtype, though I think that is dangerous as it would redefine the abstract datatype to include multi-index columns, which then would either affect, by a "chain of architecture", all mtypes in theTable
scitype, where it is unclear what would, for instance, have to happen topd.Series
ornumpy
based ones.However, the best way to proceed does not seem clear to me - hence the discussion issue.
Personally, I see two options to maintain a clearer structure, both leaving
Table
mtypes unchanged:Proba
mtype withpolars
based ones. However, this would result in onepolars
based mtype perpredict
output, which seems now less clean a design than it originally seemed when there were onlypandas
based ones.TableMI
(or similarly named), which has as ADT tables that can have a column multi-index (and, potentially, a row multi-index, but perhaps only later as we do not need this now). For the start, it can have threemtype
-s,polars
eager and lazy, andpd.DataFrame
based.We would then use the new concrete data structure in a converter in
predict_interval
andpredict_quantiles
, after the output has been produced.