isarn / isarn-sketches

Sketching data structures for scala, including t-digest
Apache License 2.0
15 stars 5 forks source link

Fast tdigest #12

Closed erikerlandson closed 5 years ago

erikerlandson commented 6 years ago

A mutable version of the t-digest data structure, implemented in java.

The basic insight is that once a sufficient set of clusters is established, the frequency of cluster insertions drops off rapidly(*) Simply updating an existing cluster can be a very fast operation, if the clusters are stored in arrays. The trade-off is that inserting a cluster becomes O(n) instead of O(log n), however this trade pays off because simple updates are far more common than insertions. There is still the need to maintain a cumulative mass sum, however this also can be maintained in an array with O(log n) efficiency using a Fenwick Tree as long as clusters are not being inserted.

I am seeing 20 to 30x faster performance compared to my immutable version.

(*) This is assuming that the incoming distribution is stationary. If a distribution is drifting, or if data are arriving in some kind of non-random and nonmonotonic order, new clusters may be inserted much more frequently until and unless the distribution re stabilizes.

erikerlandson commented 6 years ago

cc @willb @sophwats

erikerlandson commented 5 years ago

this currently has a problem w/ the sbt publishing: https://github.com/sbt/sbt/issues/4453

erikerlandson commented 5 years ago

I have a solution for publishing, see commit https://github.com/isarn/isarn-sketches/pull/12/commits/75e387fd7a04bd058c2024e7b3196f8a5c28606a above

erikerlandson commented 5 years ago

@willb bump