grafana / oncall

Developer-friendly incident response with brilliant Slack integration
GNU Affero General Public License v3.0
3.44k stars 276 forks source link

Not work integration with alertmanager #888

Closed unfeeling91 closed 1 year ago

unfeeling91 commented 1 year ago

There is some problem:

I configure alert manager integration, and in the logs I see: ts=2022-11-22T06:12:07.052Z caller=notify.go:743 level=debug component=dispatcher receiver=grafana_oncall integration=webhook[0] msg="Notify success" attempts=1 ts=2022-11-22T06:12:11.264Z caller=notify.go:743 level=debug component=dispatcher receiver=grafana_oncall integration=webhook[0] msg="Notify success" attempts=1 ts=2022-11-22T06:12:11.264Z caller=notify.go:743 level=debug component=dispatcher receiver=grafana_oncall integration=webhook[0] msg="Notify success" attempts=1 ts=2022-11-22T06:12:11.330Z caller=notify.go:743 level=debug component=dispatcher receiver=grafana_oncall integration=webhook[0] msg="Notify success" attempts=1 ts=2022-11-22T06:12:23.351Z caller=notify.go:743 level=debug component=dispatcher receiver=grafana_oncall integration=webhook[0] msg="Notify success" attempts=1

Some alerts in firing state.

But in not coming to alert groups on oncall plugin page, what could be the reason for it?

Konstantinov-Innokentii commented 1 year ago

@unfeeling91 Could you provide more details? Not a single group was generated from there alerts?

unfeeling91 commented 1 year ago

@unfeeling91 Could you provide more details? Not a single group was generated from there alerts?

Hi, the alerts in multiple groups, deployed via alertmanager section kube-prometheus-stack in k8s

  additionalPrometheusRules:
   - name: rules
     groups:
     - name: meta
         rules:
           - alert: heartbeat
             expr: vector(1)
             labels:
               severity: none
             annotations:
               description: This is heartbeat alert
               summary: Alerting Amixr
           - name: kubernetes.rules
           rules:
            - alert: KubePodCrashLooping
              expr: |
                max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1
              annotations:
                description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
                  }}) is in waiting state (reason: "CrashLoopBackOff").'
                runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping
                summary: Pod is crash looping.
              for: 15m
              labels:
                severity: warning
Konstantinov-Innokentii commented 1 year ago

@unfeeling91 Thanks. What's going on on the OnCall side? Not a single alert group was created in OnCall? (I'm asking to make sure that alerts are not grouped on the OnCall side). Also, if you are using cloud, send link to your integration please.

unfeeling91 commented 1 year ago

@unfeeling91 Thanks. What's going on on the OnCall side? Not a single alert group was created in OnCall? (I'm asking to make sure that alerts are not grouped on the OnCall side). Also, if you are using cloud, send link to your integration please.

I am not using the cloud version, on oncall not see something strange in logs 2022-11-24 05:59:54 source=engine:app google_trace_id=none logger=root inbound latency=0.226657 status=200 method=POST path=/integrations/v1/alertmanager

image

but nothing is come to groups in UI

unfeeling91 commented 1 year ago

resolved - it is alert that I send via UI pressing button send test alert

Konstantinov-Innokentii commented 1 year ago

@unfeeling91 you have your integration token in path? I mean in logs it looks like path=/integrations/v1/alertmanager/<secret_integration_token , right? If so, could you please send celery container logs.

unfeeling91 commented 1 year ago

@Konstantinov-Innokentii correct, token in place, in celery I see such info, no any errors 2022-11-24 07:30:50,364 source=engine:celery task_id=a86c0f90-e88d-43b7-b656-79c6126440fa task_name=apps.schedules.tasks.refresh_ical_files.start_refresh_ical_files name=celery.app.trace level=INFO Task apps.schedules.tasks.refresh_ical_files.start_refresh_ical_files[a86c0f90-e88d-43b7-b656-79c6126440fa] succeeded in 0.01594165700225858s: None 2022-11-24 07:30:50,364 source=engine:celery task_id=??? task_name=??? name=celery.worker.strategy level=INFO Task apps.schedules.tasks.refresh_ical_files.refresh_ical_file[2f07c05c-058d-4e25-8b2d-c130714c00f8] received 2022-11-24 07:30:50,365 source=engine:celery task_id=??? task_name=??? name=celery.worker.strategy level=INFO Task apps.slack.tasks.start_update_slack_user_group_for_schedules[396bbc9f-302d-4311-8c9b-701d3e52c8a3] received 2022-11-24 07:30:50,377 source=engine:celery task_id=1f123c78-63ab-4581-8a71-64fbf5af9504 task_name=apps.heartbeat.tasks.restore_heartbeat_tasks name=celery.app.trace level=INFO Task apps.heartbeat.tasks.restore_heartbeat_tasks[1f123c78-63ab-4581-8a71-64fbf5af9504] succeeded in 0.009998007000831421s: None 2022-11-24 07:30:50,379 source=engine:celery task_id=2f07c05c-058d-4e25-8b2d-c130714c00f8 task_name=apps.schedules.tasks.refresh_ical_files.refresh_ical_file name=apps.schedules.tasks.refresh_ical_files level=INFO Refresh ical files for schedule 2 2022-11-24 07:30:50,434 source=engine:celery task_id=2f07c05c-058d-4e25-8b2d-c130714c00f8 task_name=apps.schedules.tasks.refresh_ical_files.refresh_ical_file name=apps.schedules.tasks.refresh_ical_files level=INFO run_task_primary 2 False icals not equal 2022-11-24 07:30:50,435 source=engine:celery task_id=2f07c05c-058d-4e25-8b2d-c130714c00f8 task_name=apps.schedules.tasks.refresh_ical_files.refresh_ical_file name=celery.app.trace level=INFO Task apps.schedules.tasks.refresh_ical_files.refresh_ical_file[2f07c05c-058d-4e25-8b2d-c130714c00f8] succeeded in 0.05589042399878963s: None 2022-11-24 07:30:54,550 source=engine:celery task_id=??? task_name=??? name=celery.worker.strategy level=INFO Task apps.heartbeat.tasks.process_heartbeat_task[1b877165-b60d-4919-a38d-e35fdb87525e] received 2022-11-24 07:30:54,563 source=engine:celery task_id=1b877165-b60d-4919-a38d-e35fdb87525e task_name=apps.heartbeat.tasks.process_heartbeat_task name=apps.heartbeat.tasks level=INFO IntegrationHeartBeat selected for alert_receive_channel 8 in 0.01161727399812662 2022-11-24 07:30:54,564 source=engine:celery task_id=1b877165-b60d-4919-a38d-e35fdb87525e task_name=apps.heartbeat.tasks.process_heartbeat_task name=apps.heartbeat.tasks level=INFO heartbeat_checkup task started for alert_receive_channel 8 in 0.013139029997546459 2022-11-24 07:30:54,564 source=engine:celery task_id=1b877165-b60d-4919-a38d-e35fdb87525e task_name=apps.heartbeat.tasks.process_heartbeat_task name=apps.heartbeat.tasks level=INFO state checked for alert_receive_channel 8 in 0.013271511998027563 2022-11-24 07:30:54,566 source=engine:celery task_id=??? task_name=??? name=celery.worker.strategy level=INFO Task apps.heartbeat.tasks.integration_heartbeat_checkup[83ffd34b-4c2b-4aa3-bcff-68ccefde757e] received 2022-11-24 07:30:54,569 source=engine:celery task_id=1b877165-b60d-4919-a38d-e35fdb87525e task_name=apps.heartbeat.tasks.process_heartbeat_task name=celery.app.trace level=INFO Task apps.heartbeat.tasks.process_heartbeat_task[1b877165-b60d-4919-a38d-e35fdb87525e] succeeded in 0.01825634499982698s: None 2022-11-24 07:30:57,236 source=engine:celery task_id=6163de36-e9e0-4eea-b170-cd425b453f7e task_name=apps.heartbeat.tasks.integration_heartbeat_checkup name=apps.heartbeat.models level=INFO Heartbeat 7 is not actual 6163de36-e9e0-4eea-b170-cd425b453f7e 2022-11-24 07:30:57,239 source=engine:celery task_id=6163de36-e9e0-4eea-b170-cd425b453f7e task_name=apps.heartbeat.tasks.integration_heartbeat_checkup name=celery.app.trace level=INFO Task apps.heartbeat.tasks.integration_heartbeat_checkup[6163de36-e9e0-4eea-b170-cd425b453f7e] succeeded in 0.014365891001943965s: None

Konstantinov-Innokentii commented 1 year ago

@unfeeling91 do you see create_alertmanager_alerts tasks in logs, when you are receiving alerts?

unfeeling91 commented 1 year ago

create_alertmanager_alerts

kubectl logs oncall-celery-6d678c8bf7-jhn6w | grep create_alertmanager_alerts

no output

Konstantinov-Innokentii commented 1 year ago

@unfeeling91 what you can do to further debug the problem:

  1. Make curl request on url of your integration, with payload emulating alertmanager payload
  2. Try to create integration of other type ( e.g Webhook, and test if it's working via curl (it will accept all payloads)
Konstantinov-Innokentii commented 1 year ago

@unfeeling91 And could you please share example of payload, which AM sends to the OnCall?

unfeeling91 commented 1 year ago

@unfeeling91 And could you please share example of payload, which AM sends to the OnCall?

How to see this payload in alertmanager? For now, I see level=debug component=dispatcher receiver=grafana_oncall integration=webhook[0] msg="Notify success" attempts=1 ts=2022-11-22T06:12:11.264Z caller=notify.go:743 level=debug component=dispatcher receiver=grafana_oncall

Konstantinov-Innokentii commented 1 year ago

@unfeeling91 You can use https://webhook.site/, sent alert to this site and check payload there.

unfeeling91 commented 1 year ago

@Konstantinov-Innokentii curl via webhook work, trying to debug alertmanager payload right now.

unfeeling91 commented 1 year ago

The issue was fixed, the problem was with the ingress object, I expose on-call to separate ingress and it start working like a charm. Really appreciate @Konstantinov-Innokentii for your effort and support, issues could be closed, thanks!

Konstantinov-Innokentii commented 1 year ago

@unfeeling91 Thanks! Eager to hear more feedback/issues from you!