influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 492 forks source link

Kapacitor Errors #2344

Closed chitrarchitect closed 4 years ago

chitrarchitect commented 4 years ago

Alerts work when enabled, but simultaneously throws continuous errors and the logs drive is getting full!

lvl=error msg="failed to aggregate point in batch" service=kapacitor task_master=main task=CPU_IDLE_TIME_HIGH_WIN node=mean3 err="field Percent_Idle_Time missing from point cannot aggregate"

Another Error- lvl=error msg="error evaluating expression" service=kapacitor task_master=main task=REPLICATION_SLOTS_NOT_ACTIVE node=eval3 err="missing value: \"pg_replication_slots_active\""

chitrarchitect commented 4 years ago

@russorat @nathanielc Need your help to resolve this tricky issue!

chitrarchitect commented 4 years ago

Tickscript for Win CPU Idle alerts:

var db = 'telegraf'

var rp = 'autogen'

var measurement = 'win_cpu'

var groupBy = ['dc', 'env', 'host', 'instance']

var whereFilter = lambda: ("env" != 'prod') AND ("instance" == '_Total')

var period = 10m

var every = 30s

var name = 'cpu_idle_time_test_win'

var idVar = name + ':{{.Group}}'

var message = '{{.Level}} - {{ index .Tags "dc"}} {{ index .Tags "env"}} CPU Idle Time {{ index .Tags "host" }} is {{ index .Fields "value" | printf "%0.2f" }}%'

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var crit = 10

var data = stream |from() .database(db) .retentionPolicy(rp) .measurement(measurement) .groupBy(groupBy) .where(whereFilter) |window() .period(period) .every(every) .align() |mean('Percent_Idle_Time') .as('value')

var trigger = data |alert() .crit(lambda: "value" < crit) .message(message) .id(idVar) .idTag(idTag) .levelTag(levelTag) .messageField(messageField) .durationField(durationField) .slack() .channel('testAlarms')

trigger |eval(lambda: float("value")) .as('value') .keep() |influxDBOut() .create() .database(outputDB) .retentionPolicy(outputRP) .measurement(outputMeasurement) .tag('alertName', name) .tag('triggerType', triggerType)

trigger |httpOut('output')

chitrarchitect commented 4 years ago

Can anyone look at this as I am stuck with this selective alert rules!!! :(

chitrarchitect commented 4 years ago

I fixed the error by adding Lambda expression isPresent

IsPresent: Returns a Boolean value based on whether the specified field or tag key is present. Useful for filtering out data this is missing the specified field or tag. This returns TRUE if myfield is a valid identifier and FALSE otherwise.

Here, myfield is pg_replication_slots_status

var whereFilter = lambda: ("dc" == 'test') AND ("env" == 'test') AND isPresent("pg_replication_slots_status")

And it worked like charm!