thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.9k stars 2.07k forks source link

ruler send alerts to all alertmanager #5275

Open Aaron199 opened 2 years ago

Aaron199 commented 2 years ago

Thanos, Prometheus and Golang version used: thanos:v0.24.0 prometheus:v2.26.0

Object Storage Provider: hawei OBS What happened: i use the config "--alertmanagers.url=dns+http://alertmanager-main.monitoring:9093", "alertmanager-main" has 3 pods, but only one alertmanager has alerts, it doesn't like prometheus alerts in any alertmanagers

image:  thanosio/thanos:v0.24.0
containers:
      - args:
        - rule
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --rule-file=/etc/rules/*rules.yaml
        - --objstore.config-file=/etc/bucket.yml
        - --data-dir=/var/rule
        - --label=rule_replica="$(NAME)"
        - --alert.label-drop=rule_replica
        - --alertmanagers.url=dns+http://alertmanager-main.monitoring:9093
        - --query=http://thanos-querier-total.monitoring:10902

What you expected to happen: thanos ruler should like prometheus , alers fill in every alertmanager replicas.

How to reproduce it (as minimally and precisely as possible): just use that configs can reproduce it

image:  thanosio/thanos:v0.24.0
containers:
      - args:
        - rule
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --rule-file=/etc/rules/*rules.yaml
        - --objstore.config-file=/etc/bucket.yml
        - --data-dir=/var/rule
        - --label=rule_replica="$(NAME)"
        - --alert.label-drop=rule_replica
        - --alertmanagers.url=dns+http://alertmanager-main.monitoring:9093
        - --query=http://thanos-querier-total.monitoring:10902

Full logs to relevant components: i can't find any logs about that

Anything else we need to know:

wiardvanrij commented 2 years ago

Could you fill in the other topics;

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know:

? I'm not sure what you want to achieve at the moment. Thanks!

Aaron199 commented 2 years ago

commented

sorry about that , i already update comment

wiardvanrij commented 2 years ago

ty! :)

NissesSenap commented 2 years ago

I think I have the same issue:

  containers:
  - args:
    - rule
    - --data-dir=/thanos/data
    - --rule-file=/etc/thanos/rules/*/*.yaml
    - --query=dnssrv+_http._tcp.xks-query-frontend.monitor.svc.cluster.local
    - --alertmanagers.url=http://alertmanager.monitor.svc.cluster.local:9093
    - --remote-write.config-file=/tmp/config/rw-config.yaml

Using: quay.io/thanos/thanos:v0.25.2 For now I will lower the number of replicas to 1.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

rchien-atvi commented 2 years ago

I'm investigating a similar situation. Looking at the ruler page, specifically the --alertmanagers.url section, the behaviour described in this issue may be expected.

Alertmanager replica URLs to push firing alerts. Ruler claims success if push to at least one alertmanager from discovered succeeds

I would have thought the alertmanager cluster would reconcile the alerts amongst the replicas, but I'm guessing not.

leotomas837 commented 1 year ago

@Aaron199 Did you make the alert manager service headless ? i.e. ClusterIP: None.

I suspect that this is due to how the routing algorithm of ClusterIP services works (you could change its type to LoadBalancer or by other means for ex with Istio), it always select the same pod (as long as available). And a DNS lookup with the service name only returns one DNS entry.

For Thanos' DNS lookup to work, it needs the IPs of the pods directly. In that way it selects each pod (and not only one pod as explained above).

To make a service headless, simply set ClusterIP: None with a type of ClusterIP, see the doc.

Explanation of headless services with DNS lookup comparison here.

thorker commented 5 months ago

The trick is to configure dns service discovery in the alertmanager.url like "dnssrv+_http-web._tcp.alertmanager-operated:9093".