twitter / algebird

Abstract Algebra for Scala
https://twitter.github.io/algebird
Apache License 2.0
2.29k stars 346 forks source link

Burst Detection Monoid #162

Open sritchie opened 11 years ago

sritchie commented 11 years ago

I'd like to see a monoid for detecting bursts in streams of data:

http://www.cs.cornell.edu/home/kleinber/bhs.pdf

/cc @johnynek

avibryant commented 11 years ago

Neat paper. Is this different from fitting a piecewise-constant model to the rate at which the emails come? (I guess, to the parameter of a poisson process generating the emails?) I've had good luck with that kind of model in the past, but not sure how to implement it as a monoid...

johnynek commented 11 years ago

Actually that's an excellent point, Avi. I bet you can equate this algorithm to fitting piecewise linear with a regularization parameter (since the exponential model gives a constant rate of arrival between each transition).

This probably can be done as a moniod in the same way that IO is a monad: you use the list Monoid along with two data types: BurstHistory, ArrivalTime. So you combine ArrivalTimes into a list, but when the list gets long enough, we fit a history to it and replace the fitted points with a history of rate changes.

I think something like this could work, but you don't really want to keep the whole history of slope changes, instead you want to output the ones that can't change the future anymore, and only keep the last point for further fitting.

Avi: have you seen (even approximate) methods for doing a streaming fit to the data to the piecewise linear fit?

avibryant commented 11 years ago

Re constant rate of arrival between each transition - that's why I said piecewise-constant, not piecewise-linear. Right? My experience is that this kind of piecewise-constant model is great on short time scales (like looking at a day or week of traffic), or sometimes on single keys (ie, all traffic from referrer X), but bad for long time scales of aggregate traffic (where you might see steady or exponential growth and this is going to model it as a staircase).

I've never seen a true streaming fit - it always involves buffering up a window, like you describe... once you have a window there are a million ways to do the fit.... this paper's cost function looks pretty plausible (in particular I like the asymmetry of it being expensive to change the rate upwards but free to change the rate downwards).

Somewhat related and maybe interesting: http://www2.research.att.com/~sen/pub/nad-imc03.pdf keeps a Count-Min-Sketch for each time period, and then shows how to combine a series of historical CMS to produce a CMS of the moving average, or the Holt-Winters forecast, etc; then you can use that as the basis for alerting on the current CMS.

(Actually they don't use min() as their estimator so it's not quite a CMS, they do something slightly fancier, but I think the data structure is identical).

sritchie commented 7 years ago

Here's a working link for that approach: https://www.cs.utexas.edu/~yzhang/papers/nad-imc03.pdf