loansindi / ps1rfid

Golang RFID authentication for ps1auth
GNU General Public License v3.0
13 stars 11 forks source link

Have an alerting system for things like member lookup failures #34

Open codersquid opened 7 years ago

codersquid commented 7 years ago

It would be good for the go server to fire off an alert when it hits errors when trying to contact the member database. The alerts could be aggregated by something on the network, but should also show up on a physical display or blinkenlight in case the network is down.

loansindi commented 7 years ago

I don't know if a blinkenlight would be useful, it'd probably just make emails about "the light is blinking" that wouldn't be all that useful.

So many of our network outages are brief but catastrophic that it's hard to plan around them. Shooting an email at the least is a good first step. Are there free email providers like mandrill(they eliminated their free service recently I think) still?

On Sat, Nov 26, 2016, 3:30 PM Sheila Miguez notifications@github.com wrote:

It would be good for the go server to fire off an alert when it hits errors when trying to contact the member database. The alerts could be aggregated by something on the network, but should also show up on a physical display or blinkenlight in case the network is down.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/loansindi/ps1rfid/issues/34, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdWXs7DBifcBx9-n7quMGbMT8xw09Jbks5rCJbWgaJpZM4K8-_z .

codersquid commented 7 years ago

I am just brainstorming here.

I am not sure about the blinkenlight but some hardware thing would be good? What about an eink display that lists some status stuff? Because, you are right abut temporary failures. Things that alert all the time are annoying as hell. So, some transient thing that just does a status description is probably good.

for email, I think maybe let something else handle it. The go server can write to syslog, and other things go to syslog, and there are tools that take syslog messages and send them off to get collected. we can have something that sits around aggregating messages, and the go server can do things, for example, log non 2xx responses from ps1auth, and the aggregator can decide at some thresehold that it's important to send a message.

I think at some point kuroishi or someone set up a nagios thing too, which has some plugins for doing http checks, and we can have it hit some endpoint on the go server (and other things, like an endpoint on the ps1auth site).

For email, I've used the one that comes free with rackspace. It has a freemium model. you have to watch out things don't get classified as spam (I seem to remember that happening). I hate email.

codersquid commented 7 years ago

sorry for handwaving. anyway, when the bbb is off the network, a physically connected display/something/counter might be helpful but maybe there is a label to put next to it with a legend for what the stuff means.

for when the bbb is back on the network... handwavy again. I've used collectd to send messages to graphite. collectd has a plugin that can tail logs and send off messages. it has been a while so I don't remember how things worked. maybe there is a setting to have it retry until the network is back. that way it has a reasonable chance of collecting data (and if data gets loss, oh well. this isn't a pacemaker right?). but it might give us enough to go on when trying to figure stuff out.

loansindi commented 7 years ago

Some kind of display wouldn't be bad, for sure. I like the ideas, just spitballing

On Sun, Nov 27, 2016, 12:06 PM Sheila Miguez notifications@github.com wrote:

sorry for handwaving. anyway, when the bbb is off the network, a physically connected display/something/counter might be helpful but maybe there is a label to put next to it with a legend for what the stuff means.

for when the bbb is back on the network... handwavy again. I've used collectd to send messages to graphite. collectd has a plugin that can tail logs and send off messages. it has been a while so I don't remember how things worked. maybe there is a setting to have it retry until the network is back. that way it has a reasonable chance of collecting data (and if data gets loss, oh well. this isn't a pacemaker right?). but it might give us enough to go on when trying to figure stuff out.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/loansindi/ps1rfid/issues/34#issuecomment-263133781, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdWXj6MyvOVtfoxMl_EUp7e0ZSeyB1Mks5rCbiOgaJpZM4K8-_z .