What is it and how does it work?

Open source framework for processing, monitoring, and alerting on time series data

https://github.com/influxdata/kapacitor https://docs.influxdata.com/kapacitor/v1.3/

Configuration

Uses TOML - "Be warned, this spec is still changing a lot. Until it’s marked as 1.0, you should assume that it is unstable and act accordingly."

Interaction methods

HTTP API
- Can easily see endpoints - curl -v kapacitor:9092/kapacitor/v1/:routes
- https://docs.influxdata.com/kapacitor/v1.3/api/api/
Command line tool kapacitor which uses the api
- Included in the main install

Data sources

Streams data directly from influxDB
Supports stream and batch tasks
Plans to use kafka as a stream source - https://github.com/influxdata/kapacitor/issues/1156 / https://github.com/influxdata/kapacitor/issues/73
One instance can connect to many influxdbs
- Could have one for writing to and one for reading from
- https://github.com/influxdata/kapacitor/issues/911#issuecomment-247625148

If we don't commit to using influx we probably wouldn't want to use Kapacitor. Influx will always be the optimal data source.

Tasks (what they call alarms)

These are where all evaluation/action is taken. Pulls in the data, filters/analyses it, and performs any final actions.
Uses TICKscript - https://docs.influxdata.com/kapacitor/v1.3/tick/
- Single quotes for raw string, double quotes for tags/variables
Can define them in files to import, or pass the contents directly via the api
Can also define templates for tasks - https://github.com/influxdata/kapacitor/pull/577
- If many tasks use a template you can update the template and it'll update all associated tasks
- e.g. if we had a bake time built into the template at 5mins and support wanted to change it for everything to 10mins, it would be one simple change vs. modifying consecutive count on everything
Can utilize user defined functions (UDF)
- UDFs can be a script in any(?) language
- read/write from standard out - how does that scale?
Can redefine tasks if you need to modify vs. remove/add
Can view the alarm definitions and stats via command line
Can dynamically create new metric names - e.g. if you were only receiving used & total, you could translate that to generate a percentage metric to evaluate.
- These can then be stored back into influx
- Or even just injected back to kapacitor without storing in influx - https://docs.influxdata.com/kapacitor/v1.3/nodes/kapacitor_loopback_node/

Alerting

This is all a subset of a task
- Generally it would be at the end of the flow. i.e. you'd handle all the alarm criteria then based on that hit the .alert node.
Has 4 alert states - OK, INFO, WARNING or CRITICAL
Has notifications built in - pagerduty, victorops, slack, webhook, and more - https://docs.influxdata.com/kapacitor/v1.3/nodes/alert_node/
- Have to be configured in kapacitor.conf - so don’t work for multi-tenant
- maybe use this for some internal monitoring?
- we'd probably have to set up a webhook to post to some other notification system
Can choose between alerting on state change or on every evaluation
Can alert only at specific times - https://github.com/influxdata/kapacitor/pull/170 / https://github.com/influxdata/kapacitor/issues/168
- Does this belong here or in telegraph? Ideally, probably both.
Can send alerts back into influx - https://github.com/influxdata/kapacitor/pull/465
StateDuration & StateCount nodes can replace consecutiveCount and RBA's bake time.
- I'm thinking we could maybe even hardcode one value into a template which acts as the default for all customers to ensure they cannot flood support by changing it from something like 5mins to 0. However, if the language allows, we can append another StateDuration that defaults to 0, but that they can specify themselves to extend the default. Alternatively, we can perform validation in the api layer and use one value - that may be saner.

Failure Scenarios

Has a built-in deadman switch to detect lack of metrics
- https://docs.influxdata.com/kapacitor/v1.3//nodes/stream_node/#deadman
- We might want to use our own one to alert on rate change ration/percentage rather than a hard metrics threshold (or both).

Debugging

Can record and replay data to a specific task (alarm) for testing / troubleshooting - real time or instant replay
Error messages for invalid task DSL aren't very mature yet, so it can be difficult to track down minor errors
The kapacitor show TASK_NAME command will output performance stats that you can use to determine how well a given task might scale. But in general Kapacitor consumes CPU and RAM resources more than disk.

Anomoly detection / Predictive modeling

Can utilize UDFs for this
- Implement a UDF to use Morgoth, an unsupervised anomaly algorithm
https://docs.influxdata.com/kapacitor/v1.3/guides/anomaly_detection/

Backups

# Create a backup.
curl http://localhost:9092/kapacitor/v1/storage/backup > kapacitor.db
# Restore a backup.
# The destination path is dependent on your configuration.
cp kapacitor.db ~/.kapacitor/kapacitor.db

Not sure if we'd use this since we'd probably want to reload tasks from Cassandra.

Known problems for us

“checks” can send “metrics” in multiple batches, so the “alarm” will repeatedly see payloads with missing metrics - https://github.com/influxdata/kapacitor/issues/1217 - have to use isPresent()
- e.g. the system plugin sends both load and uptime stats, but it sends load on one line and uptime on another. In the end this ends up as one line in Influx since the timestamps match, but kapacitor receives each individually for stream tasks. batch tasks would not be affected.
- If we use isPresent() how do we handle cases where an expected metric never arrives? Currently our alarms return critical if the criteria tests a metric key that doesn't exist.
- We can't use default because we don't want to pre-populate a metric if it will arrive shortly, as that could cause the alarm to change to an undesired state, we only want to know if it never arrives.

Clustering

Without paying for the enterprise version, influxdb will not have clustering abilities. I performed a small test to see how telegraf/kapacitor handle a scenario with multiple dbs of the same name.

One telegraf instance was configured to send to two influxdb hosts. The write consistency is set to "any" so it only ends up writing to one host at a time.

A streaming task was enabled in kapacitor to log the data it saw incoming:

stream
    |from()
        .measurement('logtailf')
    |log()

This would log a message on every telegraf output regardless of the db host it ended up in.

A batch task was then enabled to query influx for the count of how many metrics it had seem.

batch
    |query('Select * from "telegraf"."autogen".logtailf')
        .period(60m)
        .every(10s)
    |count('value')
    |log()

This task did not work as desired. On each run of the query it would use a different host and therefore return different results each time. The final count returned was about half of the expected value.

TO DO Determine whether streaming tasks are all handled in memory or whether it has to query influx after new data has been received. i.e. do these tasks only get fed data from influx or can they also reach back in to influx when performing more complex evaluations.

Problem

Without clustering abilities it limits our ability on how much we can utilize kapacitor tasks. It also means we have to develop our own methods for replication and failover, as we cannot configure clients to read/write with a quorum consistency.

Other things to be aware of

Time is measured based on the timestamps of the data flowing through a node.
Kapacitor supports all the same input formats as InfluxDB including its HTTP write API. You can even configure Telegraf to send data to both Kapacitor and InfluxDB simultaneously.
Having many batch tasks could overload influx as they will execute exactly on the minute or hour (based on what's specified) at 0s.
- i.e. if we have 100 tasks to execute every minute, they will each run at 01:00:00.000, 01:01:00.000, 01:02:00.000, etc.
- We would need to use Splay - https://github.com/influxdata/kapacitor/issues/712
- Alternatively, “you can put limits on your InfluxDB server, in which case if Kapacitor was too aggressive it would get errors back instead of overloading the InfluxDB server.”

Evaluation

We must answer these questions to determine whether it is a viable option for the CEP in our v2 architecture.

Scaling

[ ] Horizontal scaling
[ ] Handle many hundreds of thousands of alarms
[ ] Can specify many data sources per alarm (to handle fine-grained partitioning)

Alarm language

[x] Easy to understand (relative to esper)
[x] Well documented

Alarming functionalities

[x] Alarms per check
[x] Alarms across checks per account
[x] Alarms across accounts
[x] Alarms can perform calculations on known metrics
[x] Multi-pass alarms
[x] Alarm on raw metrics
[x] Alarm on rollups
[x] Alarm on average over x mins
[x] Consecutive Count
[x] At least 3 state thresholds available
[x] Batch queries
[x] Streaming queries
[x] Anomoly detection
[ ] Consistency level
[ ] Alarm on rate of metrics received
[ ] Alarm on rate of metrics' values
[x] Detect lack of metrics

Notification methods

[x] Can output state changes to a place of our choice
[x] Only alert during certain days/hours

Data generation

[x] Inject newly generated metrics back into the database
[x] Inject newly generated metrics back into the stream
[x] Predictive modeling

Interaction methods

[x] HTTP API for all CRUD
[ ] Remembers alarms upon restart
[x] Can restore alarms upon restart/deploy from external source
[x] Perform mass changes to all alarms without having to modify every individual alarm (i.e. templates or regex replace)
[x] Perform mass changes to a subset of alarms without having to modify every individual alarm

Esper

[ ] Does everything we use esper for (need to gather list of these things)

Option 1: Kafka Source -> Influxdb & Kapacitor

We first store metrics in kafka. Kapacitor will stream directly from kafka. In this case we could partition on account (but use more than the 64 partitions we have right now). This is what kapacitor would stream from, but we would have something in between kafka and influx to partition it further.

This means kapacitor tasks would be able to handle cross metric/check alarms for a single account with ease. Although if we had to generate new metrics and alert on them, we may have to push back into the original kafka topic that we read from, which might not be ideal especially since there isn't currently a kafkaOut node like there is an httpOut and influxDBOut... although, I'm not sure if the task would need access to those from the feed as it might store all it's "windows" in memory when calculating averages over time?

For any cross-account checks we would have to consume from influxdb. We would probably have tasks configured on each account which post average metrics rather than performing these tasks on the raw data.

Assumptions for this to work

The kafka streaming source is available
We know how we'll partition into kafka
We know how to send data to influx from kafka
For consecutive count type tasks, it will store the metrics in memory vs. having to read everything each time. e.g. if we calculate a percentage metric from metrics originally sent from telegraf the task will remember what the value is so we can just send it into influx without having to send to kafka again.
- The problem with this is that the other things that kafka sends to (like warehouser) wouldn't receive the metric
- For it to fully work we'd need a kafkaOut node that could produce to kafka.
- Thinking about it, it must handle these tasks in memory because it wouldn't be able to query kafka like it can with influx to gather all the data.

Option 2: Kafka -> Influxdb -> Kapacitor

Kafka will only act as a buffer before the data enters influxdb. It may also feed data into the warehouser. Kapacitor will read directly from influx which is more "natural" for that ecosystem.

In this case we would partition influxdbs by accountId:managedInputType. This would allow all existing alarms to be handled by tasks which query one of these dbs (since all current alarms are only relevant to a single check).

For cross-checkType alarms, we would add a task to the relevant db instance that filters out what is needed for the alarm and forward that to another db specific for "aggregated" alarms. The task with the main alarm logic would only query the aggregated db instance.

Assumptions for this to work / potential problems

managedInputType is equivalent to checkType
influxdb can handle the extremely large number of dbs we may have
- Must be able to scale horizontally
We know how we'll partition into kafka
We know how to send data to influx from kafka
We know how to scale the aggregated alarm dbs.
- These could get very large if the alarm queries many metrics from all devices
We understand the limitations of dbs.
- Why did RPC Fabric hit limits for a particular account?
- CPU? Memory? Storage?
- Was it due to the reads/writes to influx?
  - amount of metrics flowing in and/or number of tasks querying
If an aggregated alarm is modified/deleted we would have to modify multiple associated tasks

Example: Cross Influxdb Task

We have two influxdb hosts: influxdb and influxdb2.

One telegraf configured with a system plugin to post to influxdb and a second posting tail metrics to influxdb2.

Kapacitor configured to use both influxdb hosts:

[[influxdb]]
  urls = ["http://influxdb:8086","http://influxdb2:8086"]

A task can be created to combine both metrics into one alarm:

var influx1 = stream
    |from()
        .measurement('system')

var influx2 = stream
    |from()
        .measurement('logtailf')

influx1
    |join(influx2)
        .as('system', 'tailf')
        .tolerance(1s)
        .fill(0)
    |alert()
        .crit(lambda: TRUE)
        .log('/tmp/alerts.log')

In this case we would likely need to make use of isPresent() to ensure all relevant metrics are available before making any calculations/decisions.

As far as I'm aware you cannot specifically query a certain host within a task, only a certain database. If you do not specify the database in the from node it picks one randomly. My hope for this simple case is that it is sensible enough to look through the hosts until it finds one with the required measurement, however, if we define database names like accountId:inputPluginName then we would be able to optimize things. In this scenario each host uses the same telegraf db name.

If we define database names like accountId:inputPluginName it seems no more difficult a problem than having one db for an account. In the task it will mean we have to additionally include the measurement name in the .database() field and then always require joins for cross-check alarms. Everything else remains the same (I think).

jjbuchan / docs

Kapacitor #9

What is it and how does it work?

Configuration

Interaction methods

Data sources

Tasks (what they call alarms)

Alerting

Failure Scenarios

Debugging

Anomoly detection / Predictive modeling

Backups

Known problems for us

Clustering

Problem

Other things to be aware of

Evaluation

Scaling

Alarm language

Alarming functionalities

Notification methods

Data generation

Interaction methods

Esper

Option 1: Kafka Source -> Influxdb & Kapacitor

Assumptions for this to work

Option 2: Kafka -> Influxdb -> Kapacitor

Assumptions for this to work / potential problems

Example: Cross Influxdb Task