jjbuchan / docs

0 stars 0 forks source link

Kapacitor #9

Open jjbuchan opened 3 years ago

jjbuchan commented 3 years ago

What is it and how does it work?

Open source framework for processing, monitoring, and alerting on time series data

https://github.com/influxdata/kapacitor https://docs.influxdata.com/kapacitor/v1.3/

Configuration

Interaction methods

Data sources

If we don't commit to using influx we probably wouldn't want to use Kapacitor. Influx will always be the optimal data source.

Tasks (what they call alarms)

Alerting

Failure Scenarios

Debugging

Anomoly detection / Predictive modeling

Backups

# Create a backup.
curl http://localhost:9092/kapacitor/v1/storage/backup > kapacitor.db
# Restore a backup.
# The destination path is dependent on your configuration.
cp kapacitor.db ~/.kapacitor/kapacitor.db

Not sure if we'd use this since we'd probably want to reload tasks from Cassandra.

Known problems for us

Clustering

Without paying for the enterprise version, influxdb will not have clustering abilities. I performed a small test to see how telegraf/kapacitor handle a scenario with multiple dbs of the same name.

One telegraf instance was configured to send to two influxdb hosts. The write consistency is set to "any" so it only ends up writing to one host at a time.

A streaming task was enabled in kapacitor to log the data it saw incoming:

stream
    |from()
        .measurement('logtailf')
    |log()

This would log a message on every telegraf output regardless of the db host it ended up in.

A batch task was then enabled to query influx for the count of how many metrics it had seem.

batch
    |query('Select * from "telegraf"."autogen".logtailf')
        .period(60m)
        .every(10s)
    |count('value')
    |log()

This task did not work as desired. On each run of the query it would use a different host and therefore return different results each time. The final count returned was about half of the expected value.

TO DO Determine whether streaming tasks are all handled in memory or whether it has to query influx after new data has been received. i.e. do these tasks only get fed data from influx or can they also reach back in to influx when performing more complex evaluations.

Problem

Without clustering abilities it limits our ability on how much we can utilize kapacitor tasks. It also means we have to develop our own methods for replication and failover, as we cannot configure clients to read/write with a quorum consistency.

Other things to be aware of

jjbuchan commented 3 years ago

Evaluation

We must answer these questions to determine whether it is a viable option for the CEP in our v2 architecture.

Scaling

Alarm language

Alarming functionalities

Notification methods

Data generation

Interaction methods

Esper

jjbuchan commented 3 years ago

Option 1: Kafka Source -> Influxdb & Kapacitor

We first store metrics in kafka. Kapacitor will stream directly from kafka. In this case we could partition on account (but use more than the 64 partitions we have right now). This is what kapacitor would stream from, but we would have something in between kafka and influx to partition it further.

This means kapacitor tasks would be able to handle cross metric/check alarms for a single account with ease. Although if we had to generate new metrics and alert on them, we may have to push back into the original kafka topic that we read from, which might not be ideal especially since there isn't currently a kafkaOut node like there is an httpOut and influxDBOut... although, I'm not sure if the task would need access to those from the feed as it might store all it's "windows" in memory when calculating averages over time?

For any cross-account checks we would have to consume from influxdb. We would probably have tasks configured on each account which post average metrics rather than performing these tasks on the raw data.

Assumptions for this to work

jjbuchan commented 3 years ago

Option 2: Kafka -> Influxdb -> Kapacitor

Kafka will only act as a buffer before the data enters influxdb. It may also feed data into the warehouser. Kapacitor will read directly from influx which is more "natural" for that ecosystem.

In this case we would partition influxdbs by accountId:managedInputType. This would allow all existing alarms to be handled by tasks which query one of these dbs (since all current alarms are only relevant to a single check).

For cross-checkType alarms, we would add a task to the relevant db instance that filters out what is needed for the alarm and forward that to another db specific for "aggregated" alarms. The task with the main alarm logic would only query the aggregated db instance.

Assumptions for this to work / potential problems

jjbuchan commented 3 years ago

Example: Cross Influxdb Task

We have two influxdb hosts: influxdb and influxdb2.

One telegraf configured with a system plugin to post to influxdb and a second posting tail metrics to influxdb2.

Kapacitor configured to use both influxdb hosts:

[[influxdb]]
  urls = ["http://influxdb:8086","http://influxdb2:8086"]

A task can be created to combine both metrics into one alarm:

var influx1 = stream
    |from()
        .measurement('system')

var influx2 = stream
    |from()
        .measurement('logtailf')

influx1
    |join(influx2)
        .as('system', 'tailf')
        .tolerance(1s)
        .fill(0)
    |alert()
        .crit(lambda: TRUE)
        .log('/tmp/alerts.log')

In this case we would likely need to make use of isPresent() to ensure all relevant metrics are available before making any calculations/decisions.

As far as I'm aware you cannot specifically query a certain host within a task, only a certain database. If you do not specify the database in the from node it picks one randomly. My hope for this simple case is that it is sensible enough to look through the hosts until it finds one with the required measurement, however, if we define database names like accountId:inputPluginName then we would be able to optimize things. In this scenario each host uses the same telegraf db name.

If we define database names like accountId:inputPluginName it seems no more difficult a problem than having one db for an account. In the task it will mean we have to additionally include the measurement name in the .database() field and then always require joins for cross-check alarms. Everything else remains the same (I think).