Closed chitrarchitect closed 4 years ago
@russorat @nathanielc Need your help to resolve this tricky issue!
Tickscript for Win CPU Idle alerts:
var db = 'telegraf'
var rp = 'autogen'
var measurement = 'win_cpu'
var groupBy = ['dc', 'env', 'host', 'instance']
var whereFilter = lambda: ("env" != 'prod') AND ("instance" == '_Total')
var period = 10m
var every = 30s
var name = 'cpu_idle_time_test_win'
var idVar = name + ':{{.Group}}'
var message = '{{.Level}} - {{ index .Tags "dc"}} {{ index .Tags "env"}} CPU Idle Time {{ index .Tags "host" }} is {{ index .Fields "value" | printf "%0.2f" }}%'
var idTag = 'alertID'
var levelTag = 'level'
var messageField = 'message'
var durationField = 'duration'
var outputDB = 'chronograf'
var outputRP = 'autogen'
var outputMeasurement = 'alerts'
var triggerType = 'threshold'
var crit = 10
var data = stream |from() .database(db) .retentionPolicy(rp) .measurement(measurement) .groupBy(groupBy) .where(whereFilter) |window() .period(period) .every(every) .align() |mean('Percent_Idle_Time') .as('value')
var trigger = data |alert() .crit(lambda: "value" < crit) .message(message) .id(idVar) .idTag(idTag) .levelTag(levelTag) .messageField(messageField) .durationField(durationField) .slack() .channel('testAlarms')
trigger |eval(lambda: float("value")) .as('value') .keep() |influxDBOut() .create() .database(outputDB) .retentionPolicy(outputRP) .measurement(outputMeasurement) .tag('alertName', name) .tag('triggerType', triggerType)
trigger |httpOut('output')
Can anyone look at this as I am stuck with this selective alert rules!!! :(
I fixed the error by adding Lambda expression isPresent
IsPresent: Returns a Boolean value based on whether the specified field or tag key is present. Useful for filtering out data this is missing the specified field or tag. This returns TRUE if myfield is a valid identifier and FALSE otherwise.
Here, myfield is pg_replication_slots_status
var whereFilter = lambda: ("dc" == 'test') AND ("env" == 'test') AND isPresent("pg_replication_slots_status")
And it worked like charm!
Alerts work when enabled, but simultaneously throws continuous errors and the logs drive is getting full!
lvl=error msg="failed to aggregate point in batch" service=kapacitor task_master=main task=CPU_IDLE_TIME_HIGH_WIN node=mean3 err="field Percent_Idle_Time missing from point cannot aggregate"
Another Error- lvl=error msg="error evaluating expression" service=kapacitor task_master=main task=REPLICATION_SLOTS_NOT_ACTIVE node=eval3 err="missing value: \"pg_replication_slots_active\""