prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.58k stars 2.14k forks source link

[discord] Unexpected status code 400: {"embeds": ["0"]} bug #3460

Closed stefkkkk closed 1 year ago

stefkkkk commented 1 year ago

What did you do? Trying to setup alerts into discord chat, using prometheus metric from nginx ingress controller:

- name: nginx-ingress-controller-checks
  rules:
  - alert: 4xxErrorsAppear
    expr: rate(nginx_ingress_controller_request_duration_seconds_count{status=~"4.."}[5m]) > 0
    labels:
      severity: warning
    annotations:
      summary: "4xx erros appeared on domain: https://{{ $labels.host }}"
      description: "Warning! The rate of 4xx responses on domain https://{{ $labels.host }} is above the threshold. Investigate the cause as this may indicate a problem with the service."
      value: "{{ $labels.status }}"
      method: "{{ $labels.method }}"

What did you expect to see? Expected to see alerts in format like any other alerts already set up:

Alerts Firing:
Labels:
alertname = k8sNodeHighRamUsageWarning
alertgroup = k8s-infra
cluster = eks-dev-eu-central-1
eks_amazonaws_com_instance_type = m5a.large
eks_amazonaws_com_nodegroup = shared-az-a-920230807080051364500000002
instance = ip-10-2-0-190.eu-central-1.compute.internal
job = node-exporter
severity = warning
Annotations:
description = Warning! Used RAM of Node ip-10-2-0-190.eu-central-1.compute.internal on eks-dev-eu-central-1 is more then 70%
summary = Node ip-10-2-0-190.eu-central-1.compute.internal on eks-dev-eu-central-1 cluster has high RAM usage
value = 15.05%

What did you see instead? Under which circumstances? Instead I see in alertmanager logs an error: ts=2023-08-15T17:50:14.274Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=3 err="discord/discord[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: {\"embeds\": [\"0\"]}" At least now it reproduced only for nginx ingress contoller metric, for example for metrics from black-box exporter or node-exporter works fine Environment

receivers:

Also I attached templated alert which I see and which is firing normally, but something happened exactly during sending alert into discord API 2023-08-15_22-16

stefkkkk commented 1 year ago

Was fixed by the next refactor, due to huge labels overfllowing and merging all in one field "description" in discord JSON body:

  1. Alertmanager
    
    route:
    group_by:
    - alertname
    group_wait: 1m
    group_interval: 1m
    repeat_interval: 10m
    receiver: discord

receivers:

How it become visible is on attached screenshot 2023-08-16_12-09

So main issue is not a good default alertmanager's template

stefkkkk commented 1 year ago

For debugging also will be useful to use this site: https://webhook.site/ If you will see such errors like mine, simply replace discord's webhook to webhook url on those site

ajaydevtron commented 8 months ago

Hello @stefkkkk ,

I applied the same solution to address the issue with Discord and received the alert. However, for the resolved alert, metadata is not displayed on the channel.

[FIRING:1] CPUThrottlingHigh Summary: 30.62% throttling of CPU in the namespace default for container container-name in the pod pod-name.

[RESOLVED] CPUThrottlingHigh

Can you help me here ?

stefkkkk commented 8 months ago

Hello @stefkkkk ,

I applied the same solution to address the issue with Discord and received the alert. However, for the resolved alert, metadata is not displayed on the channel.

[FIRING:1] CPUThrottlingHigh Summary: 30.62% throttling of CPU in the namespace default for container container-name in the pod pod-name.

[RESOLVED] CPUThrottlingHigh

Can you help me here ?

I have the same thing, in my case I don't care about this

ajaydevtron commented 8 months ago

Okey @stefkkkk . Do we have workaround to get that as well ?

stefkkkk commented 8 months ago

Okey @stefkkkk . Do we have workaround to get that as well ?

No, I don't

ajaydevtron commented 8 months ago

@stefkkkk My issue got resolved by this https://github.com/prometheus/alertmanager/pull/3597.

Thanks.