prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.42k stars 2.12k forks source link

Alertmanager pod msg="dropping messages because too many are queued" #2440

Open nmizeb opened 3 years ago

nmizeb commented 3 years ago

hello,

What did you do? I'am using alertmanager in a kubernetes pod, it's connected to Prometheus, Karma and Kthnxbye to ack alerts.

What did you expect to see?

normal memory usage as before

What did you see instead?

Recently, the memory usage graph of alertmanager is experiencing a linear increase. In the alertmanager logs, I have this message: level=warn ts=2020-12-17T09:32:04.281Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4100 limit=4096 Rule expression of the message :

// handleQueueDepth ensures that the queue doesn't grow unbounded by pruning
// older messages at regular interval.
func (d *delegate) handleQueueDepth() {
    for {
        select {
        case <-d.stopc:
            return
        case <-time.After(15 * time.Minute):
            n := d.bcast.NumQueued()
            if n > maxQueueSize {
                level.Warn(d.logger).Log("msg", "dropping messages because too many are queued", "current", n, "limit", maxQueueSize)
                d.bcast.Prune(maxQueueSize)
                d.messagesPruned.Add(float64(n - maxQueueSize))
            }
        }
    }
}

Please note that there is no action to justify this increase.

Environment Alertmanager : v0.21.0 Prometheus : v2.18.2

simonpasquier commented 3 years ago

It would mean that your instance can't keep up with replicating data with its peers. The alertmanager_cluster_health_score metric would tell you about your cluster's health (the lower the better, 0 if everything's fine). You can look at the alertmanager_cluster_messages_queued and alertmanager_cluster_messages_pruned_total metrics too. You may have to tune the --cluster.* CLI flags.

nmizeb commented 3 years ago

thank you @simonpasquier , alertmanager_cluster_health_score value = 0 all time since the cluster started on the other hand, alertmanager_cluster_messages_queued and alertmanager_cluster_messages_pruned_total metrics show a linear increase, is this normal behavior?

baryluk commented 2 years ago

I am noticing the same issues on single instance AM (version 0.23.0)

alertmanager_cluster_alive_messages_total{peer="01FKNQEQMADHPQF9HNAWV169DP"} 1
alertmanager_cluster_enabled 1
alertmanager_cluster_failed_peers 0
alertmanager_cluster_health_score 0
alertmanager_cluster_members 1
alertmanager_cluster_messages_pruned_total 973
alertmanager_cluster_messages_queued 4100
alertmanager_cluster_messages_received_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_size_total{msg_type="update"} 0
alertmanager_cluster_messages_received_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_total{msg_type="update"} 0
alertmanager_cluster_messages_sent_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_size_total{msg_type="update"} 0
alertmanager_cluster_messages_sent_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_total{msg_type="update"} 0
alertmanager_cluster_peer_info{peer="01FKNQEQMADHPQF9HNAWV169DP"} 1
alertmanager_cluster_peers_joined_total 1
alertmanager_cluster_peers_left_total 0
alertmanager_cluster_peers_update_total 0
alertmanager_cluster_reconnections_failed_total 0
alertmanager_cluster_reconnections_total 0
alertmanager_cluster_refresh_join_failed_total 0
alertmanager_cluster_refresh_join_total 0

Adding --cluster.listen-address= to command line does work as a workaround.

KeyanatGiggso commented 1 year ago

Any fix given for the same? @simonpasquier