Implement health checks

p2r3 / epochtal

Portal 2 tournament framework

https://epochtal.p2r3.com/

GNU General Public License v3.0

8 stars 4 forks source link

Implement health checks #56

Open soni801 opened 4 months ago

soni801 commented 4 months ago

Having a healthcheck is not strictly necessary, but it can be a nice thing to have for uptime monitoring and similar stuff. The idea is that we can call a command, and it reports whether the running epochtal instance is in a working state or not.

I'd really like to have this for the docker container (#43), which means we need to write it as a command that can can be run from the terminal (for example with bun run), and reports the health state of the container through the process exit code:

0: success - the container is healthy and ready for use 1: unhealthy - the container isn't working correctly

Here's the docker reference to this. Don't worry about the docker implementation, i'll do that, but it'd be nice if someone could ""quickly"" throw together a small script that reports the health as specified.

Edit: I'll do this too if no one else wants to, but I'm notoriously slow at figuring out the best way to implement stuff like this.

soni801 commented 4 months ago

@PancakeTAS would you be willing to look at this?

PancakeTAS commented 4 months ago

What would a healthcheck do other than test if the epochtal process is running? Epochtal doesn't even have a state machine at its core or anything of that sort so how can it not be running?

soni801 commented 4 months ago

Well, I don't know the specifics of all the parts of epochtal that need to be working at the same time, but I'd imagine running the following checks:

Try to fetch something from the web server
Try to download the spplice package
Maybe check that no critical routines or anything have errored out?

soni801 commented 4 months ago

It also occured to me now that this could be integrated with #54, but I really am not going to fuck with that. That's gonna be a huge MAYBE sometime in the future.

p2r3 commented 4 months ago

I fully support the idea of a health check, I've wanted to tackle that for a while. The way I see it, you could have a script kind of "simulate" running through the concludeWeek and releaseMap routines to check if there are any potential issues to be encountered. Same could go for run submission - just a script that checks if everything required for submission is working.

A quick and dirty (but accurate!) implementation of this could be running said routines on temporary contexts which mimic the currently active epochtal context and seeing if that runs into any issues.

soni801 commented 4 months ago

I like that approach! However, I'm a bit worried that it'll be more resource intensive than it's worth - on a project that gets as much traffic throughout the entire week as epochtal does, I'd recommend running a health check at least every 10-30 minutes. This way we can get a somewhat immediate notification if anything goes wrong.

For your idea, maybe adding a dry-run optional parameter to some/all utils is a good idea. If this parameter is true, it doesn't actually modigy anything but still reports if it's successful or not. I think that'd be a really clean approach - then we should also be able to just directly call the routine without messing around with new contexts(?)

p2r3 commented 4 months ago

You already can do a dry run! Most if not all utils just won't write or read files if you don't have the respective context.file. entry. They'll just write to the object and report success.

This was done for this very reason of creating temporary contexts and handling them with standard utils.