Closed erikerlandson closed 5 years ago
cc @willb @sophwats
this currently has a problem w/ the sbt publishing: https://github.com/sbt/sbt/issues/4453
I have a solution for publishing, see commit https://github.com/isarn/isarn-sketches/pull/12/commits/75e387fd7a04bd058c2024e7b3196f8a5c28606a above
@willb bump
A mutable version of the t-digest data structure, implemented in java.
The basic insight is that once a sufficient set of clusters is established, the frequency of cluster insertions drops off rapidly(*) Simply updating an existing cluster can be a very fast operation, if the clusters are stored in arrays. The trade-off is that inserting a cluster becomes O(n) instead of O(log n), however this trade pays off because simple updates are far more common than insertions. There is still the need to maintain a cumulative mass sum, however this also can be maintained in an array with O(log n) efficiency using a Fenwick Tree as long as clusters are not being inserted.
I am seeing 20 to 30x faster performance compared to my immutable version.
(*) This is assuming that the incoming distribution is stationary. If a distribution is drifting, or if data are arriving in some kind of non-random and nonmonotonic order, new clusters may be inserted much more frequently until and unless the distribution re stabilizes.