mozilla-services / pagerstatus

A service to automatically update Statuspage.io based on Pagerduty incidents
Apache License 2.0
7 stars 5 forks source link

Unexpectedly dying when checking incident for component #11

Closed sciurus closed 5 years ago

sciurus commented 5 years ago
08:24:25
START RequestId: 3f284b58-bfd3-11e8-ab0c-a785deacc4c4 Version: $LATEST

08:24:25
Found 3 pagerduty incidents

08:24:25
For pagerduty incident P2X6TZ2

08:24:25
END RequestId: 3f284b58-bfd3-11e8-ab0c-a785deacc4c4 

The above shows pagerstatus dying when it should be running https://github.com/mozilla-services/pagerstatus/blob/master/chalicelib/pagerduty.py#L36-L49 . None of those print calls are ever logged, which makes me think we're triggering an exception that I'm not handling when inspecting the response from Pagerduty. I tried rerunning the relevant code using the same Pagerduty incident as input but could not reproduce the problem.

Since I couldn't reproduce the problem, I put pagerstatus in debug mode (which should cause it to return stack traces for unhandled exceptions) and enabled request/response logging in API Gateway (so I can see the stack trace).

sciurus commented 5 years ago

FYI @bqbn this prevented your AMO alerts from creating a statuspage entry last night.

bqbn commented 5 years ago

Yeah, I was wondering why the status page didn't change last night. Thanks for looking into it. :)

sciurus commented 5 years ago

This affected @bqbn again last yesterday for Sentry.

I'm optimistic that at this point I've fixed the issue. The basic problem is that the shape of the data that Pagerduty returns about an incident can change, and I didn't handle enough variations of it.

While testing I inadvertently created an incident where details was a string instead of a dictionary. I don't think this was what caused the failures we saw, but I improved handling of that in 4e3af29258735bcba7d8e8855ec7e59a545540ee.

The debugging I enabled yesterday revealed that for Sentry we were dying at https://github.com/mozilla-services/pagerstatus/blob/367e422aeb14132e6065b3e65aaa6e42c77c968c/chalicelib/pagerduty.py#L42.

It turns out that for Datadog tags Pagerduty returns a comma delimited string, but for Pingdom Pagerduty returns a list. This is true regardless of the number of tags. I'm unsure how I tested Pingdom before; perhaps this is a new change on Pagerduty's part, or perhaps I was mistaken when I though I successfully tested Pingdom. Regardless, I've added code to handle this in 96bbe8896dfaa8c4f92b80a118c26ab4e03c985b