Closed man-at-home closed 7 years ago
@man-at-home sorry for late response, I just got back from my vacation. Thanks for this note, I'll check http codes.
I have the same problem. Alarm manager Log, taked from journactl of my server.
Oct 07 14:23:38 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:38Z" level=warning msg="Notify attempt 1 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:38 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:38Z" level=warning msg="Notify attempt 2 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:39 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:39Z" level=warning msg="Notify attempt 3 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:40 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:40Z" level=warning msg="Notify attempt 4 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:42 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:42Z" level=warning msg="Notify attempt 5 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:44 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:44Z" level=warning msg="Notify attempt 6 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:49 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:49Z" level=warning msg="Notify attempt 7 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:52 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:52Z" level=warning msg="Notify attempt 8 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:23:58 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:23:58Z" level=warning msg="Notify attempt 9 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:05 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:05Z" level=warning msg="Notify attempt 10 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 11 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 12 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 13 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 14 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
Oct 07 14:24:21 myMachineName.mydomain alertmanager[18803]: time="2016-10-07T14:24:21Z" level=warning msg="Notify attempt 15 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-mycharID" source="notify.go:193"
LOG of bot telegram:
Alert: {"alerts":[{"annotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"sendsAt":"","generatorURL":"http://mylocalmachinedomain:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D","labels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"startsAt":"2016-10-07T14:22:54.803Z"}],"commonAnnotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"commonLabels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"externalURL":"http://alert.greco.cf/alert-manager","groupKey":946614883222831012,"groupLabels":{"alertname":"Temperature"},"receiver":"Telegram","status":"firing","version":0}
message: %!(EXTRA string=<a href='http://alert.greco.cf/alert-manager/#/alerts?receiver=Telegram'>[FIRING:1]</a>
grouped by: alertname=<pre>Temperature</pre>
labels: job=<pre>StatsD</pre>, severity=<pre>Critical</pre>, instance=<pre>192.168.60.19</pre>
description: <pre>192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)</pre>
summary: <pre>Istance 192.168.60.19</pre>
<a href='http://mylocalmachinedomain:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D'>192.168.60.19[StatsD]</a>)
[GIN] 2016/10/07 - 14:23:37 | 200 | 119.322156ms | 127.0.0.1 | POST /alert/-154461500
I also disable gin DEBUG option. exporting apposite variable. Here i see many retry of Alertmanager, i think that bug stay in this program because i don't find anything big issues, in main repo of Prometheus, alert manager, and man-at-home have the same issues. I try his code but go compiler return error. Is there some fix for this?
@AndreaGreco I didn't have time to check @man-at-home's code, what error did you get?
Tanks for reply, I try to explain better:
I think that this is chain: alertmanager, send alert to telegram_bot, telegram_bot recive alert and send telegram messages, i recive message in Telegram Chat, telegram_bot, return 400 to alertmanager. alertmanager return error in log, and retry send messagges.
Result: Telegram chat receive 100.000 messages, alert result not send.
This is log better formatter sorry last time was terrible formatted:
host_name alertmanager: time="..." level=error msg="Error on notify: context deadline exceeded" source="notify.go:152"
host_name alertmanager: time="..." level=error msg="Notify for 1 alerts failed: context deadline exceeded" source="dispatch.go:238"
host_name alertmanager: time="..." level=warning msg="Notify attempt 1 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 2 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 3 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 4 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 5 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 6 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 7 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 8 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 9 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 10 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 11 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 12 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 13 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 14 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
host_name alertmanager: time="..." level=warning msg="Notify attempt 15 failed: unexpected status code 400 from http://127.0.0.1:9087/alert/-154461500" source="notify.go:193"
Telegram_bot in log return 200, but alertmanager get 400 maybe problem is GIN take look of this bug in GIN: #133, In telegram bot is there the same log. I report here, my log, from telegram bot:
Oct 10 10:50:09 host_name prometheus_bot[31505]: Alert: {"alerts":[{"annotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"sendsAt":"","generatorURL":"http://host_name:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D","labels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"startsAt":"2016-10-07T14:22:54.803Z"}],"commonAnnotations":{"description":"192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)","summary":"Istance 192.168.60.19"},"commonLabels":{"alertname":"Temperature","instance":"192.168.60.19","job":"StatsD","severity":"Critical"},"externalURL":"http://alert.host_name.cf/alert-manager","groupKey":946614883222831012,"groupLabels":{"alertname":"Temperature"},"receiver":"Telegram","status":"firing","version":0}
Oct 10 10:50:09 host_name prometheus_bot[31505]: message: %!(EXTRA string=<a href='http://alert.host_name.cf/alert-manager/#/alerts?receiver=Telegram'>[FIRING:1]</a>
Oct 10 10:50:09 host_name prometheus_bot[31505]: grouped by: alertname=<pre>Temperature</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: labels: job=<pre>StatsD</pre>, severity=<pre>Critical</pre>, instance=<pre>192.168.60.19</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: description: <pre>192.168.60.19 Temperature of CPU : (current value: 160.596264867°C)</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: summary: <pre>Istance 192.168.60.19</pre>
Oct 10 10:50:09 host_name prometheus_bot[31505]: <a href='http://host_name:9090/graph#%5B%7B%22expr%22%3A%22raspb_temperatura%20%3E%2050%22%2C%22tab%22%3A0%7D%5D'>192.168.60.19[StatsD]</a>)
Oct 10 10:50:09 host_name prometheus_bot[31505]: [GIN] 2016/10/10 - 10:50:09 | 200 | 158.262881ms | 127.0.0.1 | POST /alert/-154461500
Oct 10 10:50:09 host_name prometheus_bot[31505]: [GIN-debug] [WARNING] Headers were already written. Wanted to override status code 400 with 200
Thank you for help
Andrea
Hi, yes this is the behavior I had too - the bot tries to set the return code to 200 at the end, that does not work and the 400 code set by c.BindJSON(&alerts) will be returned instead. Alertmanager will keep alerting on 400 response again and again.
I hacked the fix in 2 lines (avoided c.BindJSON() so I have the bot working on my installation), but the fix is kind of ugly, so I did not submit this as a patch.
@man-at-home I try copy past your code it not compile, but i don't know golang then i miss somethink.
I am at work now, I try to put it into a small patch after work, but it should be one line change only (ahh, eventually one import of the binding namespace more...)
@man-at-home Return that is simbol is not defined:
# command-line-arguments
./main.go:106: undefined: binding in binding.JSON
Yes. add below line ""github.com/gin-gonic/gin" (line 7) "github.com/gin-gonic/gin/binding"
@man-at-home why don't you fire a PR?
I'm gonna testing my configuration of Prometheus, but has just stop to send 100.000 messages, over Telegram.
We waiting for your PR.
Thank you all for help.
Andrea
ok, try https://github.com/inCaller/prometheus_bot/pull/2 , hope it helps.
Closing.
thank you for your work. I am using this bot now in my prometheus installation.
I had to change the deserialization of alerts for me. Though it worked it always hat an 400 http error set - and at least with my windows environment the subsequent "c.AbortWithStatus(http.StatusOK)" would not work - so alertmanager would get an 400 back und retry the message endlessly. So I changed: