Empirical distribution refactor

YeungOnion commented 5 months ago

Was considering making a change to Empirical, albeit mostly out of interest (not because I've benched and find it lacking).

The current implementation of Empirical uses BTreeMap where entries have a float key and a observation count value. I thought it might be cool to leverage the structure of the tree with the fact that the CDF would have the same ordering as the keys (from $S$), i.e.

$$ \begin{align} \textrm{ if } F(y) - F(x) \geq \\, &0 \left\{ x,y \in S \right\} \\ \textrm{ then } (y-x) \geq \\,&0 \\ \end{align} $$

INLINE_EDIT: after thinking that map is a function (even if not one I could write down better than a table of sorts), I realize that this is a Tree Map with a monotone function for the map. This is much more clearly stated and easier to google about.

By making an entry's value to instead be a cumulative count of observations in lesser child nodes + observation count in the entry, then

CDF computations become accumulation over Tree search. Cost of search, could work without stack from recursion.
inverse CDF computations become a search where the termination condition is with the mapped value, but traversal works with key ordering. Nearly cost of search (assuming cost value comparison is same as cost of key comparison and memory for value comparison negligible)
PMF computations is difference of CDF, if search can emit traversal to found entry, then this could easily be optimized to better than two CDF calcuations.
key insertion will get more expensive, will have to modify all values that are ≥key in a BTree Node for each increasing depth branch step, rotations and node separation are trickier. If anyone has some good resources for writing proofs about B-trees, I need the support of better notation before I can get to a proof.

Other things that came to mind that make more visible changes

precompute mode at insertion as {count: u64, values: Vec<T>}
- note that our Mode trait allows only single mode distributions, which is infeasible to guarantee for Empirical, perhaps we should switch some of the statistics traits to use associated types for their returns.
use ordered_float::NotNan

YeungOnion commented 5 months ago

@dhardy would you happen to know of an implementation for prefix sums over a BTree? (I may also post this question to the rust lang user forum)

The motive beyond interest is in sampling from Empirical, the naive option to sample uniform through inverse cdf would be very expensive since inverse cdf is done via bisection and cdf requires visiting ≈1/2 the tree's nodes. (Improvement could certainly be made to memoize within a single call to inverse cdf).

Fenwick tree is great if I'd have a way to keep a positional index for data in the tree, but for the cost to update the prefix sum, $\mathcal O (\log N)$, I could also search for the key in the BTree and do the updates along larger values as nodes are traversed, $\mathcal O (k \log N)$, i.e. comparable. Would you be able to guess if this is a case where an implementation likely won't make those gains?

dhardy commented 5 months ago

@YeungOnion you can just iterate over a BTree. Efficient key-order iteration is one of its key properties.

statrs-dev / statrs

Empirical distribution refactor #245