m-lab / alertmanager-github-receiver

Prometheus Alertmanager webhook receiver that creates GitHub issues from alerts
Apache License 2.0
48 stars 23 forks source link

relabel resolved alerts #37

Closed cjyar closed 4 years ago

cjyar commented 4 years ago

When an alert stops firing, it would be nice to relabel the corresponding issue rather than closing it. This would allow people to process the alert even if it "fixed itself."

What would a PR for this look like? I can write one if we find a design we agree on.

I'm thinking of a flag like stopped-label that would take the name of the "stopped" label to apply. When an alert fires, we would search for issues regardless of the stopped label. If the issue already exists, and it has the stopped label, we would remove that label.

stephen-soltesz commented 4 years ago

Hello, @cjyar -- I like this idea! There is a flag for -enable-auto-close but we leave that off for exactly the reason you describe.

Where does the term "stopped" come from? Prometheus has a notion of ALERTS with an alertstate label that is either "pending", "firing", or non-existent.

Could the github-receiver apply the 'firing' label to issues for firing alerts and remove it for resolved alerts? Or, possibly adding a label called "resolved"? (I seem to prefer that to the word "stopped")

Perhaps a flag like: -label-on-resolved?

cjyar commented 4 years ago

Hi @stephen-soltesz. I see you're right; looking at the alertmanager docs, the proper term is resolved. So -label-on-resolved seems like a good name.

You're suggesting that there be an additional -label-on-firing? I don't think we have a need for that, but it should be straightforward to add it along with the other.

stephen-soltesz commented 4 years ago

Hi, @cjyar I was thinking out loud re: the "firing" label. Your original idea of adding a new label when the alert is "resolved" sounds like the right place to start. :+1:

cjyar commented 4 years ago

This little PR is growing. The code no longer compiles against the latest github.com/google/go-github because their API has diverged. I've worked around the problem by adding a go.mod to the repository, which I realize may not be popular with you. I'm open to ideas on a better way to handle the problem; maybe a separate PR?

Edit: In my local copy it's pinned against go-github v17.0.0.

stephen-soltesz commented 4 years ago

Our team's thinking about go mod has evolved. It's good to add :+1:

cjyar commented 4 years ago

Are you waiting on me to close this?

BTW, I've been running with this change for a few days now, and it's good but not perfect. It's using edge-triggering, meaning it tries to relabel resolved alerts when it receives the resolved message from alertmangaer. If that relabel fails (e.g., hitting GitHub's API rate limit), then it'll never try again.

It would be better to have some kind of level-triggering logic to eventually sweep up these missed issues. Maybe a 4 hour timer like alertmanager has?

stephen-soltesz commented 4 years ago

@cjyar my preference is for the github-receiver should remain stateless, which would mean inheriting the edge-triggering semantics of the alertmanager for now.

How often are you seeing missed events?

stephen-soltesz commented 4 years ago

I think this is resolved. Let me know if not.

cjyar commented 4 years ago

It is. Sorry for not replying sooner; I saw a bunch of dropped events when I rolled out this version, but it seems well behaved now. Probably it was bumping up against the API rate limits before, and the new version is being more reasonable.