sktime / skpro

A unified framework for tabular probabilistic regression, time-to-event prediction, and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
250 stars 46 forks source link

[ENH] Improve efficiency of `Histogram Distribution` #405

Open ShreeshaM07 opened 5 months ago

ShreeshaM07 commented 5 months ago

Describe the maintenance issue

The Histogram distribution implemented in #382, #335 takes 2 parameters bins and bin_mass which consists of ragged arrays in the 2D case. These ragged arrays are stored in a list as it is not possible to vectorize the ragged inputs easily. One way to vectorize them and make all of them the same shape is to pad them with 0s in case of bin_mass and pad bins with -np.inf and np.inf on the left and right side to make them equal in length.

But the problem with this approach of vectorizing is that it takes longer time than the current approach, as although the running times of the methods mean,var per se are improved by a factor of 5 but the time to pad the inputs in the above mentioned way itself is taking a lot of time which is giving worse efficiency results overall than the current approach.

Refer here to know more about the benchmarking of the Histogram Distribution.

The idea of taking the input from the user itself in this vectorized way with all the inputs padded with 0s and infs does not seem to be a very good idea as this would be very inconvenient for the user to pad them manually in cases where the lengths of the inputs vary by a big number and this would also not allow for tuple inputs in cases where the bins are of equal widths.

The Histogram Distribution inherits from the _BaseArrayDistribution which inherits the BaseDistribution with some overriding of private functions to accomodate the array distribuitons. Thoughts on ways of merging this with BaseDistribution without having to create a separate base class for arrays is also appreciated.

fkiraly commented 5 months ago

Indeed - there is also the open question how the parameters should be encoded, as an "inefficient" representation may mask any benfits from vectorization.

This is an open question, I would guess that pd.DataFrame with nested row index might be a way to go, since we can make use of groupby etc.

Less of a priority for now though, so parking thoughts in this issue.