Open YeungOnion opened 5 months ago
@dhardy would you happen to know of an implementation for prefix sums over a BTree? (I may also post this question to the rust lang user forum)
The motive beyond interest is in sampling from Empirical, the naive option to sample uniform through inverse cdf would be very expensive since inverse cdf is done via bisection and cdf requires visiting ≈1/2 the tree's nodes. (Improvement could certainly be made to memoize within a single call to inverse cdf).
Fenwick tree is great if I'd have a way to keep a positional index for data in the tree, but for the cost to update the prefix sum, $\mathcal O (\log N)$, I could also search for the key in the BTree and do the updates along larger values as nodes are traversed, $\mathcal O (k \log N)$, i.e. comparable. Would you be able to guess if this is a case where an implementation likely won't make those gains?
@YeungOnion you can just iterate over a BTree. Efficient key-order iteration is one of its key properties.
Was considering making a change to Empirical, albeit mostly out of interest (not because I've benched and find it lacking).
The current implementation of Empirical uses
BTreeMap
where entries have a float key and a observation count value. I thought it might be cool to leverage the structure of the tree with the fact that the CDF would have the same ordering as the keys (from $S$), i.e.$$ \begin{align} \textrm{ if } F(y) - F(x) \geq \\, &0 \left\{ x,y \in S \right\} \\ \textrm{ then } (y-x) \geq \\,&0 \\ \end{align} $$
INLINE_EDIT: after thinking that map is a function (even if not one I could write down better than a table of sorts), I realize that this is a Tree Map with a monotone function for the map. This is much more clearly stated and easier to google about.
By making an entry's value to instead be a cumulative count of observations in lesser child nodes + observation count in the entry, then
See also: prefix sums, Fenwick Tree Wiki
Other things that came to mind that make more visible changes
{count: u64, values: Vec<T>}
Mode
trait allows only single mode distributions, which is infeasible to guarantee forEmpirical
, perhaps we should switch some of the statistics traits to use associated types for their returns.ordered_float::NotNan