influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 492 forks source link

Randomized batch evaluation time offset via Splay property #712

Open lpalm opened 8 years ago

lpalm commented 8 years ago

We have a bunch of alerts that get evaluated every X, where X is a multiple of 1 minute (could be as large as 24 hours). Looks like kapacitor always evaluates the batch at the XX:XX:00.000 mark, causing many alerts to be evaluated simultaneously at the minute mark followed by a long period of nothing. "Rounder" time intervals suffer more, e.g. at midnight every single alert is evaluated. This is not currently a problem but it in the interests of load balancing it would be great to have a way to randomly offset the initial start time of some alert that runs every X minutes by a value Y where Y := random(0-1.0) * X, so that alerts would be evaluated at time Y, X+Y, 2X+Y, etc. This way the read load against the database would be more smoothly distributed.

Is this currently possible?

nathanielc commented 8 years ago

Can you share an example TICKscript? By default the times are not aligned/rounded but can be made to be aligned via the align property.

lpalm commented 8 years ago

|query('''SELECT X FROM Y WHERE Z''') .period(5m) .every(5m) .align()

We have align() set in case we need to join multiple series. Removing align() should improve load distribution for simple alerts and we'll do so. However, if we need to join data from multiple windows (we'll want to align them in this case correct?), is there a way to stagger evaluation times?

nathanielc commented 8 years ago

You can use tolerance on the join node so you don't need to align the queries.

But really like you point out evaluation times and query time bounds do not need to be coupled.

Seems like adding a splay property that would spread out execution times over a given interval would be useful. For example:

|query('''SELECT X FROM Y WHERE Z''')
    .period(5m)
    .every(5m)
    .align()
    .splay(3m)

This would leave the time bounds on the queries unmodified but the query would be run anywhere from 0-3m later. The splay would be chosen once per task start so that its consistent but still random per query per task. How does that sound?

lpalm commented 8 years ago

That sounds like exactly what we want. Might make sense to validate that the splay amount is always smaller than the input to "every" too.

Thanks, Nathaniel!