[discord] Unexpected status code 400: {"embeds": ["0"]} bug

stefkkkk commented 1 year ago

What did you do? Trying to setup alerts into discord chat, using prometheus metric from nginx ingress controller:

- name: nginx-ingress-controller-checks
  rules:
  - alert: 4xxErrorsAppear
    expr: rate(nginx_ingress_controller_request_duration_seconds_count{status=~"4.."}[5m]) > 0
    labels:
      severity: warning
    annotations:
      summary: "4xx erros appeared on domain: https://{{ $labels.host }}"
      description: "Warning! The rate of 4xx responses on domain https://{{ $labels.host }} is above the threshold. Investigate the cause as this may indicate a problem with the service."
      value: "{{ $labels.status }}"
      method: "{{ $labels.method }}"

What did you expect to see? Expected to see alerts in format like any other alerts already set up:

Alerts Firing:
Labels:
alertname = k8sNodeHighRamUsageWarning
alertgroup = k8s-infra
cluster = eks-dev-eu-central-1
eks_amazonaws_com_instance_type = m5a.large
eks_amazonaws_com_nodegroup = shared-az-a-920230807080051364500000002
instance = ip-10-2-0-190.eu-central-1.compute.internal
job = node-exporter
severity = warning
Annotations:
description = Warning! Used RAM of Node ip-10-2-0-190.eu-central-1.compute.internal on eks-dev-eu-central-1 is more then 70%
summary = Node ip-10-2-0-190.eu-central-1.compute.internal on eks-dev-eu-central-1 cluster has high RAM usage
value = 15.05%

What did you see instead? Under which circumstances? Instead I see in alertmanager logs an error: ts=2023-08-15T17:50:14.274Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=3 err="discord/discord[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: {\"embeds\": [\"0\"]}" At least now it reproduced only for nginx ingress contoller metric, for example for metrics from black-box exporter or node-exporter works fine Environment

Alertmanager version:

image: prom/alertmanager:v0.25.0

Alertmanager configuration file:


route:
group_by:
- alertname
group_wait: 10s
group_interval: 1m
repeat_interval: 10m
receiver: discord

receivers:

name: discord discord_configs:
- webhook_url: MY_WEBHOOK send_resolved: true

Logs:

ts=2023-08-15T17:50:14.274Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=3 err="discord/discord[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: {\"embeds\": [\"0\"]}"

Also I attached templated alert which I see and which is firing normally, but something happened exactly during sending alert into discord API 2023-08-15_22-16

stefkkkk commented 1 year ago

Was fixed by the next refactor, due to huge labels overfllowing and merging all in one field "description" in discord JSON body:

Alertmanager


route:
group_by:
- alertname
group_wait: 1m
group_interval: 1m
repeat_interval: 10m
receiver: discord

receivers:

name: discord discord_configs:
- webhook_url: MY_WEB_HOOK send_resolved: true title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}({{ .CommonLabels.severity | toUpper }})' message: |- {{ range .Alerts.Firing }} {{ .Annotations.description }} {{ end }}
```
2. Alerts
```
name: nginx-ingress-controller-checks rules:
- alert: 4xxErrorsAppear expr: rate(nginx_ingress_controller_request_duration_seconds_count{status=~"4.."}[5m]) > 0 labels: severity: warning annotations: description: |- The rate of 4xx responses on domain https://{{ $labels.host }} is above the threshold. Investigate the cause as this may indicate a problem with the service. Code: {{ $labels.status }} Used method: {{ $labels.method }} Cluster: {{ $labels.cluster }} Dashboard: https://foobar.io/d/qwwtqwr421/nginx-ingress-controller?orgId=1&refresh=10s&from=now-3h&to=now&var-cluster={{ $labels.cluster }}&var-namespace={{ $labels.namespace }}&var-controller_class=All&var-controller=All&var-ingress={{ $labels.host }}&var-status_code={{ $labels.status }}

How it become visible is on attached screenshot 2023-08-16_12-09

So main issue is not a good default alertmanager's template

stefkkkk commented 1 year ago

For debugging also will be useful to use this site: https://webhook.site/ If you will see such errors like mine, simply replace discord's webhook to webhook url on those site

ajaydevtron commented 8 months ago

Hello @stefkkkk ,

I applied the same solution to address the issue with Discord and received the alert. However, for the resolved alert, metadata is not displayed on the channel.

[FIRING:1] CPUThrottlingHigh Summary: 30.62% throttling of CPU in the namespace default for container container-name in the pod pod-name.

[RESOLVED] CPUThrottlingHigh

Can you help me here ?

stefkkkk commented 8 months ago

Hello @stefkkkk ,

I applied the same solution to address the issue with Discord and received the alert. However, for the resolved alert, metadata is not displayed on the channel.

[FIRING:1] CPUThrottlingHigh Summary: 30.62% throttling of CPU in the namespace default for container container-name in the pod pod-name.

[RESOLVED] CPUThrottlingHigh

Can you help me here ?

I have the same thing, in my case I don't care about this

ajaydevtron commented 8 months ago

Okey @stefkkkk . Do we have workaround to get that as well ?

stefkkkk commented 8 months ago

Okey @stefkkkk . Do we have workaround to get that as well ?

No, I don't

ajaydevtron commented 8 months ago

@stefkkkk My issue got resolved by this https://github.com/prometheus/alertmanager/pull/3597.

Thanks.

prometheus / alertmanager

[discord] Unexpected status code 400: {"embeds": ["0"]} bug #3460