Closed cjyar closed 4 years ago
Hello, @cjyar -- I like this idea! There is a flag for -enable-auto-close
but we leave that off for exactly the reason you describe.
Where does the term "stopped" come from? Prometheus has a notion of ALERTS
with an alertstate
label that is either "pending", "firing", or non-existent.
Could the github-receiver apply the 'firing' label to issues for firing alerts and remove it for resolved alerts? Or, possibly adding a label called "resolved"? (I seem to prefer that to the word "stopped")
Perhaps a flag like: -label-on-resolved
?
Hi @stephen-soltesz. I see you're right; looking at the alertmanager docs, the proper term is resolved. So -label-on-resolved
seems like a good name.
You're suggesting that there be an additional -label-on-firing
? I don't think we have a need for that, but it should be straightforward to add it along with the other.
Hi, @cjyar I was thinking out loud re: the "firing" label. Your original idea of adding a new label when the alert is "resolved" sounds like the right place to start. :+1:
This little PR is growing. The code no longer compiles against the latest github.com/google/go-github
because their API has diverged. I've worked around the problem by adding a go.mod
to the repository, which I realize may not be popular with you. I'm open to ideas on a better way to handle the problem; maybe a separate PR?
Edit: In my local copy it's pinned against go-github v17.0.0.
Our team's thinking about go mod has evolved. It's good to add :+1:
Are you waiting on me to close this?
BTW, I've been running with this change for a few days now, and it's good but not perfect. It's using edge-triggering, meaning it tries to relabel resolved alerts when it receives the resolved message from alertmangaer. If that relabel fails (e.g., hitting GitHub's API rate limit), then it'll never try again.
It would be better to have some kind of level-triggering logic to eventually sweep up these missed issues. Maybe a 4 hour timer like alertmanager has?
@cjyar my preference is for the github-receiver should remain stateless, which would mean inheriting the edge-triggering semantics of the alertmanager for now.
How often are you seeing missed events?
I think this is resolved. Let me know if not.
It is. Sorry for not replying sooner; I saw a bunch of dropped events when I rolled out this version, but it seems well behaved now. Probably it was bumping up against the API rate limits before, and the new version is being more reasonable.
When an alert stops firing, it would be nice to relabel the corresponding issue rather than closing it. This would allow people to process the alert even if it "fixed itself."
What would a PR for this look like? I can write one if we find a design we agree on.
I'm thinking of a flag like
stopped-label
that would take the name of the "stopped" label to apply. When an alert fires, we would search for issues regardless of the stopped label. If the issue already exists, and it has the stopped label, we would remove that label.