inCaller / prometheus_bot

Telegram bot for prometheus alerting
MIT License
397 stars 183 forks source link

thank you / hint for http 400 response on windows #1

Closed man-at-home closed 7 years ago

man-at-home commented 8 years ago

thank you for your work. I am using this bot now in my prometheus installation.

I had to change the deserialization of alerts for me. Though it worked it always hat an 400 http error set - and at least with my windows environment the subsequent "c.AbortWithStatus(http.StatusOK)" would not work - so alertmanager would get an 400 back und retry the message endlessly. So I changed:

        var alerts Alerts
        //      c.BindJSON(&alerts)
        binding.JSON.Bind(c.Request, &alerts)
hryamzik commented 8 years ago

@man-at-home sorry for late response, I just got back from my vacation. Thanks for this note, I'll check http codes.

AndreaGreco commented 8 years ago

I have the same problem. Alarm manager Log, taked from journactl of my server.

Oct 07 14:23:38 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:38Z" level=warning msg="Notify attempt 1 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:38 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:38Z" level=warning msg="Notify attempt 2 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:39 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:39Z" level=warning msg="Notify attempt 3 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:40 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:40Z" level=warning msg="Notify attempt 4 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:42 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:42Z" level=warning msg="Notify attempt 5 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:44 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:44Z" level=warning msg="Notify attempt 6 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:49 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:49Z" level=warning msg="Notify attempt 7 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:52 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:52Z" level=warning msg="Notify attempt 8 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:58 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:58Z" level=warning msg="Notify attempt 9 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:05 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:05Z" level=warning msg="Notify attempt 10 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 11 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 12 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 13 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 14 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 15 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"

LOG of bot telegram:

Alert: {"alerts":[{"annotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"sendsAt":"","generatorURL":"http://mylocalmachinedomain:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D","labels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"startsAt":"2016-10-07T14:22:54.803Z"}],"commonAnnotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"commonLabels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"externalURL":"http://alert.greco.cf/alert-manager","groupKey":946614883222831012,"groupLabels":{"alertname":"Temperature"},"receiver":"Telegram","status":"firing","version":0}
message: %!(EXTRA string=<a href='http://alert.greco.cf/alert-manager/#/alerts?receiver=Telegram'>[FIRING:1]</a>
grouped by: alertname=<pre>Temperature</pre>
labels: job=<pre>StatsD</pre>, severity=<pre>Critical</pre>, instance=<pre>192.168.60.19</pre>
description: <pre>192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)</pre>
summary: <pre>Istance 192.168.60.19</pre>
<a href='http://mylocalmachinedomain:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D'>192.168.60.19[StatsD]</a>)
[GIN] 2016/10/07 - 14:23:37 | 200 |  119.322156ms | 127.0.0.1 |   POST    /alert/-154461500

I also disable gin DEBUG option. exporting apposite variable. Here i see many retry of Alertmanager, i think that bug stay in this program because i don't find anything big issues, in main repo of Prometheus, alert manager, and man-at-home have the same issues. I try his code but go compiler return error. Is there some fix for this?

hryamzik commented 7 years ago

@AndreaGreco I didn't have time to check @man-at-home's code, what error did you get?

AndreaGreco commented 7 years ago

Tanks for reply, I try to explain better:

I think that this is chain: alertmanager, send alert to telegram_bot, telegram_bot recive alert and send telegram messages, i recive message in Telegram Chat, telegram_bot, return 400 to alertmanager. alertmanager return error in log, and retry send messagges.

Result: Telegram chat receive 100.000 messages, alert result not send.

This is log better formatter sorry last time was terrible formatted:

host_name alertmanager: time="..." level=error msg="Error on notify: context deadline exceeded" source="notify.go:152"
host_name alertmanager: time="..." level=error msg="Notify for 1 alerts failed: context deadline exceeded" source="dispatch.go:238"
host_name alertmanager: time="..." level=warning msg="Notify attempt 1 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 2 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 3 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 4 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 5 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 6 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 7 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 8 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 9 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 10 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 11 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 12 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 13 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 14 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 15 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"

Telegram_bot in log return 200, but alertmanager get 400 maybe problem is GIN take look of this bug in GIN: #133, In telegram bot is there the same log. I report here, my log, from telegram bot:

Oct 10 10:50:09 host_name prometheus_bot[31505]: Alert: {"alerts":[{"annotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"sendsAt":"","generatorURL":"http://host_name:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D","labels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"startsAt":"2016-10-07T14:22:54.803Z"}],"commonAnnotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"commonLabels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"externalURL":"http://alert.host_name.cf/alert-manager","groupKey":946614883222831012,"groupLabels":{"alertname":"Temperature"},"receiver":"Telegram","status":"firing","version":0}
Oct 10 10:50:09 host_name prometheus_bot[31505]: message: %!(EXTRA string=<a href='http://alert.host_name.cf/alert-manager/#/alerts?receiver=Telegram'>[FIRING:1]</a>
Oct 10 10:50:09 host_name prometheus_bot[31505]: grouped by: alertname=<pre>Temperature</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: labels: job=<pre>StatsD</pre>, severity=<pre>Critical</pre>, instance=<pre>192.168.60.19</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: description: <pre>192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: summary: <pre>Istance 192.168.60.19</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: <a href='http://host_name:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D'>192.168.60.19[StatsD]</a>)
Oct 10 10:50:09 host_name prometheus_bot[31505]: [GIN] 2016/10/10 - 10:50:09 | 200 |  158.262881ms | 127.0.0.1 |   POST    /alert/-154461500
Oct 10 10:50:09 host_name prometheus_bot[31505]: [GIN-debug] [WARNING] Headers were already written. Wanted to override status code 400 with 200

Thank you for help

Andrea

man-at-home commented 7 years ago

Hi, yes this is the behavior I had too - the bot tries to set the return code to 200 at the end, that does not work and the 400 code set by c.BindJSON(&alerts) will be returned instead. Alertmanager will keep alerting on 400 response again and again.

I hacked the fix in 2 lines (avoided c.BindJSON() so I have the bot working on my installation), but the fix is kind of ugly, so I did not submit this as a patch.

AndreaGreco commented 7 years ago

@man-at-home I try copy past your code it not compile, but i don't know golang then i miss somethink.

man-at-home commented 7 years ago

I am at work now, I try to put it into a small patch after work, but it should be one line change only (ahh, eventually one import of the binding namespace more...)

image

AndreaGreco commented 7 years ago

@man-at-home Return that is simbol is not defined:

# command-line-arguments
./main.go:106: undefined: binding in binding.JSON
man-at-home commented 7 years ago

Yes. add below line ""github.com/gin-gonic/gin" (line 7) "github.com/gin-gonic/gin/binding"

hryamzik commented 7 years ago

@man-at-home why don't you fire a PR?

AndreaGreco commented 7 years ago

I'm gonna testing my configuration of Prometheus, but has just stop to send 100.000 messages, over Telegram.

We waiting for your PR.

Thank you all for help.

Andrea

man-at-home commented 7 years ago

ok, try https://github.com/inCaller/prometheus_bot/pull/2 , hope it helps.

hryamzik commented 7 years ago

Closing.