Open ShreeshaM07 opened 5 months ago
Indeed - there is also the open question how the parameters should be encoded, as an "inefficient" representation may mask any benfits from vectorization.
This is an open question, I would guess that pd.DataFrame
with nested row index might be a way to go, since we can make use of groupby
etc.
Less of a priority for now though, so parking thoughts in this issue.
Describe the maintenance issue
The
Histogram distribution
implemented in #382, #335 takes 2 parametersbins
andbin_mass
which consists of ragged arrays in the2D
case. These ragged arrays are stored in alist
as it is not possible to vectorize the ragged inputs easily. One way to vectorize them and make all of them the same shape is to pad them with0
s in case ofbin_mass
and padbins
with-np.inf
andnp.inf
on the left and right side to make them equal in length.But the problem with this approach of vectorizing is that it takes longer time than the current approach, as although the running times of the methods
mean
,var
per se are improved by a factor of5
but the time to pad the inputs in the above mentioned way itself is taking a lot of time which is giving worse efficiency results overall than the current approach.Refer here to know more about the benchmarking of the Histogram Distribution.
The idea of taking the input from the user itself in this vectorized way with all the inputs padded with
0
s andinf
s does not seem to be a very good idea as this would be very inconvenient for the user to pad them manually in cases where the lengths of the inputs vary by a big number and this would also not allow fortuple
inputs in cases where the bins are of equal widths.The
Histogram Distribution
inherits from the_BaseArrayDistribution
which inherits theBaseDistribution
with some overriding of private functions to accomodate the array distribuitons. Thoughts on ways of merging this withBaseDistribution
without having to create a separate base class for arrays is also appreciated.