influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 492 forks source link

Alerting misbehaviour #2431

Closed m4ce closed 3 years ago

m4ce commented 3 years ago
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .where(whereFilter)
    |eval(lambda: "pod_name")
        .as('pod_name')
        .tags('pod_name')
        .keep()
    |groupBy(groupBy)
    |barrier()
        .idle(1h)
        .delete(TRUE)
    |difference('restarts_total')
        .as('restarts')
    |window()
        .period(1m)
        .every(1m)
    |sum('restarts')
        .as('restarts_last_min')
    |stateCount(lambda: "restarts_last_min" == 0)
        .as('warnResetCount')

var trigger = data
    |alert()
        .warn(lambda: "restarts_last_min" > 1)
        .warnReset(lambda: "warnResetCount" >= 5)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .log('/tmp/alerts.log')

the warn condition is quite clear. It should only send a warning if more than one restart has occurred in the last minute. Recover after 5m of no restarts.

Seems like the alert is generated even when restarts_last_min == 0.

{"id":"Infra-Kubernetes-ContainerRestarting1","message":"*[ec1|monitoring|testing-exit-code-139-5b998b8484-mvk54|testing-exit-code-139]* - Container has restarted 0 time(s) in the last minute","details":"{\u0026#34;Name\u0026#34;:\u0026#34;kubernetes_pod_container_status\u0026#34;,\u0026#34;TaskName\u0026#34;:\u0026#34;Infra-Kubernetes-ContainerRestarting1\u0026#34;,\u0026#34;Group\u0026#34;:\u0026#34;cluster_name=ec1,container_name=testing-exit-code-139,namespace=monitoring,pod_name=testing-exit-code-139-5b998b8484-mvk54\u0026#34;,\u0026#34;Tags\u0026#34;:{\u0026#34;cluster_name\u0026#34;:\u0026#34;ec1\u0026#34;,\u0026#34;container_name\u0026#34;:\u0026#34;testing-exit-code-139\u0026#34;,\u0026#34;namespace\u0026#34;:\u0026#34;monitoring\u0026#34;,\u0026#34;pod_name\u0026#34;:\u0026#34;testing-exit-code-139-5b998b8484-mvk54\u0026#34;},\u0026#34;ServerInfo\u0026#34;:{\u0026#34;Hostname\u0026#34;:\u0026#34;kapacitor\u0026#34;,\u0026#34;ClusterID\u0026#34;:\u0026#34;78c68c2a-9227-4cef-be5b-90eb514b4749\u0026#34;,\u0026#34;ServerID\u0026#34;:\u0026#34;26537c4f-1381-436e-b681-89e596ada339\u0026#34;},\u0026#34;ID\u0026#34;:\u0026#34;Infra-Kubernetes-ContainerRestarting1\u0026#34;,\u0026#34;Fields\u0026#34;:{\u0026#34;restarts_last_min\u0026#34;:0,\u0026#34;warnResetCount\u0026#34;:2},\u0026#34;Level\u0026#34;:\u0026#34;WARNING\u0026#34;,\u0026#34;Time\u0026#34;:\u0026#34;2020-11-13T12:59:40Z\u0026#34;,\u0026#34;Duration\u0026#34;:799000000000,\u0026#34;Message\u0026#34;:\u0026#34;*[ec1|monitoring|testing-exit-code-139-5b998b8484-mvk54|testing-exit-code-139]* - Container has restarted 0 time(s) in the last minute\u0026#34;}\n","time":"2020-11-13T12:59:40Z","duration":799000000000,"level":"WARNING","data":{"series":[{"name":"kubernetes_pod_container_status","tags":{"cluster_name":"ec1","container_name":"testing-exit-code-139","namespace":"monitoring","pod_name":"testing-exit-code-139-5b998b8484-mvk54"},"columns":["time","restarts_last_min","warnResetCount"],"values":[["2020-11-13T12:59:40Z",0,2]]}]},"previousLevel":"WARNING","recoverable":true}

Check the columns - restarts_last_min = 0 but the level is WARNING.

Not sure how this is possible?

m4ce commented 3 years ago

It seems that when the current level is WARNING and the warnReset condition is not met, it ignores the warn condition (even if it's not met) and actually triggers a new warning? If I remove the warnReset or add stateChangesOnly, it stops from sending a new warning when the condition is not met, which is correct. Though, I have no way of defining the warnReset then. Seems like a bug? I would expect the following behaviour:

Current level: OK warn condition is met -> send warning warnReset condition is NOT met -> do nothing

Current level: WARNING warn condition is NOT met -> do nothing warnReset condition is NOT met -> do nothing (-> this instead seems to trigger a new warning!)

Mind that I don't want to use stateChangesOnly. I want to be able to send a warning if a value is INCREASING between intervals and only reset when it's been stable for the past 5m. If the value does not increase, no warning should be sent out. However, kapacitor still sends it.

m4ce commented 3 years ago

@nathanielc - would you be able to confirm the above?

docmerlin commented 3 years ago

@m4ce, this is working as intended.

the warn condition is quite clear. It should only send a warning if more than one restart has occurred in the last minute. Recover after 5m of no restarts.

Seems like the alert is generated even when restarts_last_min == 0.

It is not the way it works. Warn creates a condition that puts it in the .warn state and it will continue sending warnings so long as it is in that state. .warnReset clears the state. if you do not specify a .warnReset or a different state change it will stop sending warnings when the .warn lambda is no longer true.

Mind that I don't want to use stateChangesOnly. I want to be able to send a warning if a value is INCREASING between intervals and only reset when it's been stable for the past 5m. If the value does not increase, no warning should be sent out. However, kapacitor still sends it.

I suggest using .nonNegativeDerivative to find if it is increasing and make it alert on that.