sensu-plugins / sensu-plugins-pagerduty

This plugin provides a Sensu handler for PagerDuty.
http://sensu-plugins.io
MIT License
20 stars 32 forks source link

Pagerduty alert text is not updated by handler #14

Open nicholascapo opened 8 years ago

nicholascapo commented 8 years ago

The alert text in Pagerduty does not change even when the output from the check has changed. This means alerts can get suddenly worse, without me being aware of it.

Example:

Suppose I have a disk usage check which runs every ten minutes.

The disk is filling up so the check begins alerting (as seen in Uchiwa): CheckDisk WARNING: / 71%

I get a Pagerduty alert with text: CheckDisk WARNING: / 71%

Ten minutes later, the disk is filling up really fast so the alert changes (again as seen in Uchiwa): CheckDisk CRITICAL: / 95%

But the alert in Pagerduty still says: CheckDisk WARNING: / 71%

analytically commented 8 years ago

The event_summary field is never touched in the handler. Can you show us a way to replicate this?

analytically commented 8 years ago

Are you setting the incident_key field?

https://developer.pagerduty.com/documentation/integration/events/trigger -> if incident_key is set, this event will be appended to that incident's log, but the text might remain the same

nicholascapo commented 8 years ago

@analytically No I'm not setting the incident_key field.

Example configuration:

The PageryDuty plugin was installed using /opt/sensu/embedded/bin/gem install sensu-plugins-pagerduty

This example uses the date command, but the above example of disk usage still works.

/etc/sensu/plugins/check-example:

#!/bin/bash
echo $(date)
exit 1

/etc/sensu/conf.d/check-example.json:

{
  "checks": {
    "disk": {
      "command": "/etc/sensu/plugins/check-example",
      "handlers": [
        "pagerduty"
      ],
      "interval": 10,
      "subscribers": [
        "all"
      ]
    }
  }
}

/etc/sensu/conf.d/handler-pagerduty.json:

{
  "handlers": {
    "pagerduty": {
      "command": "/opt/sensu/embedded/bin/handler-pagerduty.rb",
      "type": "pipe"
    }
  },
  "pagerduty": {
    "api_key": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
  }
}
nicholascapo commented 8 years ago

I think what I'm asking for is that the description be updated every time the alert is triggered: https://developer.pagerduty.com/documentation/integration/events/trigger

zbintliff commented 8 years ago

What I just wrote was wrong. But for this it doesn't make sense because then Sensu will have to keep track of outputs instead of exit code.

zbintliff commented 8 years ago

Also, what you link makes it look like the description of the Alert is immutable

majormoses commented 7 years ago

I wonder if now that there is the concept of an alert and an incident we create an alert and attach it to an incident and then close and open new incidents with the same alert to keep track of history? BTW I still don't like this I just can't think of a way to solve it nicely given the constraints of sensu and pagerduty.

eliten00b commented 7 years ago

I would be still interessted in a long term solution. We have the exact same problem. A check by default alert a critical issue on pagerduty, but for each client we override the pd_override for warnings to a lower issue on pagerduty.

The problem is that only the last issue get resolved.

Examples: ok -> warning -> ok : the low issue in pagerduty is resolved ok -> warning -> critical -> ok : the high issue in pagerduty is resolved, the low issue stay ok -> warning -> critical -> warning -> ok : the lower issue is resolved, the high issue stay

zbintliff commented 7 years ago

I would argue this isn't an issue with the handler but with how sensu handles state transitions. Additionally, its a hard problem to solve since Sensu only keeps a finite history you can always think of a case where one will be unsolved. (ok -> w -> C (10000 times) -> ok)

majormoses commented 7 years ago

I will ping some internal sensu folks and get their take on it but I agree there is really not much that I see we can do given how sensu handles state transition and also how the pagerduty api works.

majormoses commented 7 years ago

Also I had a conversation with pagerduty about this a while ago and so we can't change the actual severity of the event once it has been sent to pagerduty.

majormoses commented 7 years ago

The only thing I can think of is the hack outlined here: https://github.com/sensu-plugins/sensu-plugins-pagerduty/issues/30#issuecomment-289813445

portertech commented 7 years ago

@eliten00b to properly manage a PagerDuty incident for each Sensu check severity transition, this handler plugin could use an incident key that includes severity. This handler plugin could then check for the existence of a PagerDity incident for the previous severity (Sensu event check history -1), and resolve the incident if it exists.

portertech commented 7 years ago

@zbintliff I don't see "how sensu handles state transitions" has something to do with this issue, can you elaborate and provide an example?

portertech commented 7 years ago

@majormoses I see this as a PagerDuty API limitation, being able to update the incident description would be rad, and make things considerably simpler.

zbintliff commented 7 years ago

@portertech Sensu itself doesn't issue a RESOLVE event on state transition (2 -> 1), only when it hits 0. Having a handler take that logic in itself is a bit out of scope for its responsibility.

If the handler takes responsibility of it then we need to know the check occurrences required to alert and check that many back in history. Then we also need to check if it happened enough to acctually alert. For example, if I have occurrences set to 5 and my history is 0,1,2,2,2,2,2 We will trigger a critical. We go back 5 and see a 1 but it didn't happen enough to actually Alert with a warning.

zbintliff commented 7 years ago

btw I agree updating description on an alert would be nice.

portertech commented 7 years ago

@zbintliff I would leverage PagerDuty's API and its use of composed incident keys and the existence of a matching open incident, instead of having to consider Sensu event filtering methods (which could also be customized and added to by a user).

eliten00b commented 7 years ago

Thanks for all the response. I fully agree that everything would be a workaround because of pagerduty api limitation.

I will try to solve the problem with a filter.

mbbroberg commented 6 years ago

What would you say the right move is here? I think documenting this as a known limitation to PagerDuty's API with a workaround of using filtering would be handy. Could be in a callout in the main README.md. Thoughts?

majormoses commented 6 years ago

Definitely we should document this behavior as the text will not update and there is no workaround. There is overlap with this in #30 which there are workarounds for even in they are hacky. Going forward on #30 I would guess that we filter events in the handler (I know we are supposed to be trending away from that but I see no way around this) and use occurrences_watermark to resolve issues to handle the transition from crit <-> warn. This would be something that we would not enable by default (to avoid breaking changes) and new config would be needed to enable and tune how it works.

ashleyabrooks commented 6 years ago

Hello from the PagerDuty Team! Following up to confirm this is a limitation in the PagerDuty API rather than in the Sensu integration itself. At the moment, incidents are immutable. I've submitted a feature request from the maintainer of this integration so our product team knows mutable incidents are important to our customers. If you have any questions, please feel free to reach out to support@pagerduty.com.

majormoses commented 6 years ago

Thanks for confirming this.