prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.63k stars 2.15k forks source link

OpsGenie: Empty description causes status code 400 and thus blocks notification even for later alert with non-empty description #270

Closed m-o-e closed 6 years ago

m-o-e commented 8 years ago

Whenever I restart alertmanager, my log is flooded with the following messages:

time="2016-03-09T01:14:14+01:00" level=warning msg="Notify attempt 11 failed: unexpected status code 400" source="notify.go:193"

Alertmanager quickly backs off, but continues to spam the log with these warnings. This is without any active alerts. I think they are leftovers from past opsgenie alerts.

Proposed fix: Alertmanager should cancel notifications when (or shortly after) the alert that triggered them is closed.

grobie commented 8 years ago

I'm not sure canceling notifications when the alert is gone is the right fix. Scenario: Alert starts, Alert is sent to Opsgenie, connection to opsgenie is unavailable, alert closes. We still want to let OpsGenie know that the alert has been closed once we can reach OpsGenie again.

I'd rather propose to find out why OpsGenie reponds with a 400 here. The status code indicates the request sent by alertmanager is invalid and we should fix that.

m-o-e commented 8 years ago

I think the 400s may indeed happen when the alert is closed on the OpsGenie side, and alertmanager ends up trying to 're-close' an already closed alert.

Edit: Can confirm that the retrying calls go out to api.opsgenie.com. Can't inspect the content because they run over SSL, but since there is currently no active alert in alertmanager, I guess the above theory makes the most sense.

grobie commented 8 years ago

In that case Alertmanager's duty would be done to close an alert and it should stop doing so. I propose to:

  1. Add a debug log line to actually log the body of an unexpected response
  2. Validate the theory
  3. Parse the body of 400 responses and handle "already closed" as success
m-o-e commented 8 years ago

Agree +1.

grobie commented 8 years ago

@m-o-e Can you test #271? Let me know if I should build a binary for you.

m-o-e commented 8 years ago

Now that's what I call quick response!

@grobie Yes a linux-amd64 binary would be nice (building is probably easy, but it's late here ;) )

grobie commented 8 years ago

@m-o-e alertmanager.zip

m-o-e commented 8 years ago

@grobie Thanks. I haven't gotten around to testing it yet. Will report back when I had a chance to test (hopefully this week).

amorken commented 8 years ago

Incidentally, I think I've stumbled on the same problem, and my proposed fix is in PR #277. @m-o-e, if you feel like testing that branch I'd appreciate it. =)

m-o-e commented 8 years ago

Sorry for the long latency, busy times here.

@grobie I have now switched our alertmanager to your build and will report back in a few days (when we've had at least one alert), whether the failed Notify attempts still happen.

Build info
branch  log-opsgenie-error-response
buildDate   20160308-21:36:01
buildUser   grobie@gropad
goVersion   1.6
revision    bdb2cd4
version 0.1.0

@amorken Sorry, don't currently have time for multiple tests. I will try your PR if @grobie's doesn't work. Maybe you two can compare diff's, I imagine your fixes are probably very similar?

lukaf commented 8 years ago

I've seen similar errors.

Checking OpsGenie logs shows message field is empty and dumping request from alertmanager to OpsGenie shows all fields that are supposed to be received from Prometheus are indeed empty:

{
  "apiKey": "...",
  "alias": 7433040972645057000,
  "message": "",
  "details": {
    "summary": "",
    "value": ""
  },
  "source": "prometheus",
  "teams": "all",
  "tags": "none"
}

I'm using Prometheus 1.6.2 if that helps.

lukaf commented 8 years ago

A bit more info.

Empty content seems to be sent only when using templates.

receivers:
- name: 'engineering'
  opsgenie_configs:
  - api_key: '...'
    description: '{{ index .CommonAnnotations "description" }}'
    source: 'prometheus'
    send_resolved: true
    details:
      summary: '{{ index .CommonAnnotations "summary" }}'
      value: '{{ index .CommonAnnotations "value" }}'

After a while alertmanager would start sending empty messages (as in previous post). Workaround seems to be to add some content outside the template. This will produce nonsense notifications sent to OpsGenie but it will also send all real notifications.

I've modified description to be description: 'A {{ index .CommonAnnotations "description" }}' and am suppressing notifications on messages containing only "A ".

alertmanager: 0.1.1 prometheus: 0.17.0

brian-brazil commented 8 years ago

It sounds like we should send a space if the description is empty.

lukaf commented 8 years ago

Hi @brian-brazil

Description is only empty in one message.

I currently I have 3 alerts. Two of them were sent by alertmanager to OpsGenie as expected. When third alert was created an empty message was sent to OpsGenie which then complains that the message is empty (probably even if it contains spaces). When I changed description as explained above, I get an alert with message "A " and the third alert.

So in the end, prometheus shows 3 alerts, alertmanager shows 3 alerts but OpsGenie shows 4 alerts, 3 of them are known and the fourth being "A ".

Update: This is from OpsGenie: Alert " A " created on 12-May-2016 13:50 So it seems template part is not space but empty.

brian-brazil commented 8 years ago

So OpsGenie is accepting the alert with the empty message, but returning a 400? This sounds like a bug on the OpsGenie side.

lukaf commented 8 years ago

I guess I'm not being clear enough.

Having configuration that uses only template to define value of description can cause alertmanager to send empty description values to OpsGenie. If that happens, OpsGenie refuses notification and returns 400.

While alertmanager is trying to send those empty desccription it seems to be blocking other alerts that would otherwise have description filled as expected.

So having "A" in front of the template part makes OpsGenie accept the message and alertmanager is able to send other alerts as well.

brian-brazil commented 8 years ago

While alertmanager is trying to send those empty desccription it seems to be blocking other alerts that would otherwise have description filled as expected.

Ah, a head of line blocking issue. This is the first report we've had of that, and that's a concerning problem.

ryancurrah commented 8 years ago

I'm experiencing the same issue as above. I only get 400 error empty message fields in OpsGenie when alerts are grouped.

These are the settings and resulting alerts for testing...

Prometheus...

# Alert for when any service registered in consul is down
ALERT ConsulServiceDownCrit
  IF sum by (node, service)(consul_catalog_service_node_healthy == 0)
  FOR 5m
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "Service on {{ $labels.node }} down",
    description = "{{ $labels.node }} with service {{ $labels.service }} is down.",
    value = "{{ $value }}"
  }

prom_alerts

Alertmanager...

global: {}

inhibit_rules:
  - equal:
      - alertname
      - cluster
      - service
    source_match:
        severity: critical
    target_match:
        severity: warning

route:
  group_by:
    - alertname
    - cluster
    - service
  group_interval: 5m
  group_wait: 30s
  receiver: eis-engineering-pager
  repeat_interval: 3h

receivers:
  - name: eis-engineering-pager
    hipchat_configs:
      - api_url: REDACTED
        auth_token: REDACTED
        message_format: html
        notify: true
        room_id: 33

    opsgenie_configs:
      - api_key: REDACTED
        description: '{{ index .CommonAnnotations "description" }}'
        send_resolved: true
        source: prometheus
        teams: engineering_team
        details:
          summary: '{{ index .CommonAnnotations "summary" }}'
          value: '{{ index .CommonAnnotations "value" }}'

alertmanager_alerts

Alertmanager log errors

{"level":"warning","msg":"Notify attempt 12 failed: unexpected status code 400","source":"notify.go:193","time":"2016-06-23T19:35:40-05:00"}
{"level":"warning","msg":"Notify attempt 12 failed: unexpected status code 400","source":"notify.go:193","time":"2016-06-23T19:35:54-05:00"}
{"level":"warning","msg":"Notify attempt 13 failed: unexpected status code 400","source":"notify.go:193","time":"2016-06-23T19:36:15-05:00"}
{"level":"warning","msg":"Notify attempt 13 failed: unexpected status code 400","source":"notify.go:193","time":"2016-06-23T19:36:24-05:00"}

OpsGenie log error


{
  "_result": {
    "integrationName": "Default API",
    "integrationActionName": "Create Alert",
    "alertActionType": "create"
  },
  "error": "com.ifountain.naas.util.DiscardableValidationException: Can not validate alert. Reason: Property [message] cannot be empty.",
  "_incomingData": {
    "_parsedData": {
      "-originalTags": [],
      "note": "null",
      "_extra_properties": {
        "summary": "",
        "value": "0"
      },
      "_originalExtraProperties": {
        "summary": "",
        "value": "0"
      },
      "apiKey": "REDACTED",
      "alertSource": "API[Default API-Create Alert] with customSource[prometheus]",
      "_target_integration_action_": "create",
      "source": "prometheus",
      "_details": {
        "summary": "",
        "value": "0"
      },
      "message": "",
      "_extraProperties": {
        "summary": "",
        "value": "0"
      },
      "triggeringAction": "null",
      "-teams": [
        "engineering_team"
      ],
      "alertNotificationPreference": "null",
      "delayIfDoesNotExists": "true",
      "alias": "REDACTED",
      "-originalActions": [],
      "user": "System"
    },
    "integrationType": "API",
    "_httpBodyJson": {
      "apiKey": "REDACTED",
      "teams": "engineering_team",
      "alias": "REDACTED",
      "_details": {
        "summary": "",
        "value": "0"
      },
      "source": "prometheus",
      "message": ""
    },
    "_httpParams": {},
    "integrationName": "Default API",
    "httpUrl": "/v1/json/alert",
    "incomingDataId": "REDACTED",
    "_httpHeaders": {
      "X-ELB-IP": "REDACTED",
      "Connection": "close",
      "User-Agent": "Go-http-client/1.1",
      "X-Forwarded-Proto": "https",
      "X-Forwarded-For": "REDACTED",
      "Host": "localhost:9001",
      "Accept-Encoding": "gzip",
      "Content-Length": "178",
      "X-Forwarded-Port": "443",
      "Content-Type": "application/json"
    }
  }
}

So what happens when these alerts fire is only the alerts that are not grouped make it through to OpsGenie without a 400 response. So I checked the OpsGenie logs and saw the 400 error with complaints about the message field being empty. The working alerts had the message field filled out.

I thought maybe it was a template issue, with a template variable unavailable when alerts are grouped. I removed $labels.node variable from the rules template and restarted Prometheus.

When the alerts went from pending to alerting, it worked this time! The grouped alerts are now showing up in OpsGenie!

It seems to be an issue with template rendering errors during grouping, now I can't put the node/hostname in the alert description. Which limits the contextual information I can include in the summary. I would love to be able to include host names.

Not sure how to handle this. I hope this is similar to the issue reported.

beorn7 commented 8 years ago

@fabxc Do you understand what's going on here? Sounds worrying…

thenayr commented 8 years ago

Running into this same problem, I was trying to have Prometheus send slack my notifications with alerts that contained $labels.instance. All of my messages were coming in blank.

My solution was to add instance to the group_by list, then the label becomes available in a grouped alert.

Reference:

ALERT kubeMinionMemoryFree
  IF node_memory_MemFree+node_memory_Buffers+node_memory_Cached < 6000000000
  ANNOTATIONS {
    summary = "Low memory on node",
    description = "{{ $labels.instance }} has low memory (current value: {{ $value }})"
  }
route:
  group_by: ['alertname', 'cluster', 'service', 'instance']

receivers:
- name: 'slack-op-notifications'
  slack_configs: 
  - send_resolved: true
    api_url: xxx
    channel: '#xxx'
    text: '{{ .CommonAnnotations.summary }}, {{ .CommonAnnotations.description }}'
brian-brazil commented 8 years ago

That's unrelated, without breaking out by instance the description would be different between each alert in the group and thus not common,

thenayr commented 8 years ago

@brian-brazil Thanks for clearing that up, not to detract from this issue, but what would the solution be to having a grouped alert show a list of all instances that are currently firing that alert? It seems to be the problem we are both trying to solve.

amorken commented 8 years ago

@thenayr - You can iterate over the individual alert instances using Go templating - here's an example that creates ...rather verbose output:

text: >-
  {{ range .Alerts }}
      Summary: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Playbook: {{ .Annotations.playbook }}
      Graph: {{ .GeneratorURL }}
      Details:
      {{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
      {{ end }}
  {{ end }}

I've left this response here since this issue seems to attract questions about this topic, but I think maybe we should do further discussions about this topic on https://groups.google.com/forum/#!forum/prometheus-developers or perhaps even the IRC channel, see https://prometheus.io/community/. =)

beorn7 commented 8 years ago

Updated the title to not lose track what this is actually about.

m0untains commented 7 years ago

fwit, I have observed similar issues with hipchat as the backend. version=0.4.2

time="2017-01-17T17:39:53Z" level=warning msg="Notify attempt 10 failed: unexpected status code 400" source="notify.go:193"
time="2017-01-17T17:40:11Z" level=warning msg="Notify attempt 11 failed: unexpected status code 400" source="notify.go:193"
time="2017-01-17T17:40:32Z" level=warning msg="Notify attempt 12 failed: unexpected status code 400" source="notify.go:193"
time="2017-01-17T17:41:32Z" level=warning msg="Notify attempt 13 failed: unexpected status code 400" source="notify.go:193"
time="2017-01-17T17:42:54Z" level=warning msg="Notify attempt 14 failed: unexpected status code 400" source="notify.go:193"
m0untains commented 7 years ago

In regards to my previous post (regarding failed hipchat notifications with a 400 response code), we finally figured out the issue, I will post it here in case others stumble upon the same thing:

Basically hipchat will reject (with a 400 response code) any message field that is 0 bytes or greater than 10000 bytes. In our case we hit the upper limit. Here are a few ideas for improvement in the notification handler:

That last point is probably the most important (and how we discovered our issue). When hipchat returns a 400, the response body contains the exact reason of why it was rejected. Had this information been logged by AlertManager, it would have made understanding the issue immensely easier.

stuartnelson3 commented 6 years ago

Truncation was added in 877d2780