Feature: Dead Man's Switch

ansuz07 commented 3 years ago

Have DB ping an external status monitor periodically (e.g. hourly) to inform that the bot is still alive.

I run a status monitor that we can use, so it would just need to send a GET request to this address: https://redacted/ping/82929bbc-1a19-49ef-86d4-950664a54192

Example code below for C#:

using (var client = new System.Net.WebClient()) { client.DownloadString("https://redacted/ping/82929bbc-1a19-49ef-86d4-950664a54192"); }

hacksoncode commented 3 years ago

Interesting Idea!

I believe it already restarts itself periodically, which is why we more rarely have to do it these days.

Not sure how helpful this will be, both for that reason and because almost all of our "outages" have actually been bugs where it just stops processing a certain subset of things, like the PM handler. It's still "alive", just not working right.

Also... since it's running on an Azure VM... I would expect the bot itself to be far more reliable than the status monitor machine.

None of which is necessarily a reason not to do it at some point.

ansuz07 commented 3 years ago

I've already got a monitor hacked together to try to combat that issue - I monitor the DB RSS feed from Reddit and if it doesn't see activity for ~6 hours it will alert me (thats how I knew about the outage this morning).

Just wanted to see if there were any other options to keep track of the activity. You are probably right about the status monitor being less reliable (since I run it) but I don't see the harm in a false positive every now and then.

I see your point about each of the individual modules getting hung - spitballing here - could it send the GET request at the start of the module? This way, we know if any individual module hangs and can restart the bot. (I can gen up as many of those links as we require).

hacksoncode commented 3 years ago

Has the outage this morning been reported to Hallidev? My not noticing it is a good argument for your proposal if it would have actually caught it...

On Mon, May 3, 2021 at 8:27 AM ansuz07 @.***> wrote:

I've already got a monitor hacked together to try to combat that issue - I monitor the DB RSS feed from Reddit and if it doesn't see activity for ~6 hours it will alert me (thats how I knew about the outage this morning).

Just wanted to see if there were any other options to keep track of the activity. You are probably right about the reliability of the status monitor being less reliable (since I run it) but I don't see the harm in a false positive every now and then.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hallidev/delta-bot-four/issues/21#issuecomment-831339247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXZDYIVVECH4WIWTFGQFEDTL26EHANCNFSM44A7G66Q .

-- Ray

ansuz07 commented 3 years ago

Yeah - I talked to Hallidev this morning and got it sorted first thing.

I don't know exactly what went wrong or if my initial proposal would have caught it - I only knew because my RSS feed monitor pinged me. It works, but it is a hacky solution - I have it set to 6 hours so I don't get false positives due to lack of delta activity. I just wanted to explore the option of having DB ping directly so we'd know sooner.

hacksoncode commented 3 years ago

Yeah... I guess my main concern is that we'd want to be sure that it wouldn't itself have a chance of bringing down (or massively slowing down) DB if the status monitor went down or went berserk.

My first thought was: Well just fire up another thread to do it asynchronously so it can't get "blocked" by some issue with the request to the status server... but... that has the obvious flaw that this thread will probably keep running no matter whether DB is doing it's job or not.

But maybe we could think of a method that has better risk/reward.

ansuz07 commented 3 years ago

I use this solution to monitor the various apps in my personal stack. It just uses a simple GET request to the web address (I use curl or wget on my linux boxes) - if the address gets the request every period, it doesn't ping me, but if it fails to get the request for a period of time then it does. All the heavy lifting is done by the status monitor and the monitor doesn't send or request anything from the VMs themselves. So long as the bit of code we insert can deal with a standard HTML 404 error and not crash it shouldn't be a big issue.

That said, it was just one idea that I thought might be easy to implement since I'm already running the status server for other uses.

hallidev commented 3 years ago

I think the idea is that it would be a dead man's switch of sorts. It would run on a separate thread like @hacksoncode says and if it didn't periodically check in, we'd at least know that the bot wasn't running. It wouldn't say anything about whether the bot was running properly. It would just give us a preemptive heads up that it's not running at all.

It's not a big lift, so I think it's worth it. Once I get a few minutes I'll this this in there

ansuz07 commented 3 years ago

Cool. Its all set up on my end - all you need to do is send a GET request

Let me know if/when you get it set up and what the interval is.

hallidev commented 3 years ago

Will do. I have the link - you should edit it out of your comment

ansuz07 commented 3 years ago

Done - thanks.

ansuz07 commented 3 years ago

As an FYI, I'm starting to see pings on my end

hallidev commented 3 years ago

Yeah I just deployed a change with health pings every 5 minutes. Not sure how the warning is setup on your side, but if the bot doesn't check in for 10 minutes, probably worth taking a look.

ansuz07 commented 3 years ago

Great. I'll finish configuring on my end.

hacksoncode commented 3 years ago

Hate to be an SQA pest... and maybe this already happened... but could we test the case where the status server goes down and make sure we don't have any obvious major issues like large memory leaks, too many queued requests, deadlocks/thread issues, etc.?

I think it's reasonably safe to ignore the case that the status server goes berserk and does something unexpectedly malicious... at least for now.

Probably the easiest way to test that is just change the address to some known nonexistent server and maybe some server that exists but has a big complex 404 page (to simulate forgetting to renew DDNS or something).

hallidev commented 3 years ago

I purposefully wrote it to swallow any errors coming from the ping process. If you look in appsettings.json, you'll see that it's set to a fake URL in the repo. That's what I used to test with. Any problems with the health check url definitely won't interrupt the bot

hallidev / delta-bot-four

Feature: Dead Man's Switch #21