Healthchecks options for monitoring

brandtkeller commented 3 years ago

Less of a request and more of general knowledge gathering (if available).

I am writing my own orchestration for this image on kubernetes following some design standards that I feel are important/different from other implementations.

As such, I was curious if there are any exposed options for monitoring the health of the server?

These may be actions that establish - "Has the game server fully started?", "Is the game server still running?"

Any help would be appreciated! Thanks

sisu4u commented 3 years ago

Not sure if that's possible easily. k8s does not support udp health checks: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#tcp-probes. An alternative would be some sort of script inside the container that checks the server connection when it's called and return exit code != 0 iff failed. K8s can then use this script as a health check.

lloesche commented 3 years ago

@brandtkeller I just added something for you in #173 that might be helpful. The process supervisor I'm using inside the container provides an XML-RPC API. This API can provide status information about the containers different services. The most interesting one being valheim-server I suppose. Details here https://github.com/lloesche/valheim-server-docker#supervisor

I know, XML-RPC, likely not something you can plug natively into a livenessProbe but I figured baby steps. We'll see if it is useful for this issue and if not we can always look for other options.

A simpler readinessProbe could also be to just check if /var/run/valheim-server.pid exists. If that's the case the valheim_server is running. That does not mean however it is currently accepting connections. It could be in the process of starting up. It could also be constantly crashing and in a restart loop.

Then for pretty high reliability that Valheim server is running and ready to accept connections there is /proc/net/udp6 and /proc/net/udp.

You could check if anything is bound on $SERVER_PORT. To find out you would take the second field split at the colon and convert the second field of this split from hex to dec and compare to $SERVER_PORT. A one-liner could look like this

awk -v search_port=2456 'BEGIN {ec=1} {if ($1 ~ /^[0-9]/) {split($2, local_bind, ":"); port=sprintf("%d", "0x" local_bind[2]); if (port == search_port){ec=0}}} END {exit ec}' /proc/net/udp*

If this exits 0 a process is listening on search_port and if it exits 1 no process was found that bound to this port.

lloesche commented 3 years ago

Thinking about it, I could just integrate that last check into the valheim-server start script and produce a more easily consumable status file for you.

What would be better for your use case? If I created e.g. a file /opt/valheim/status with content like starting, ready, stopping or if there were separate files to check for like /opt/valheim/status/starting, /opt/valheim/status/ready, /opt/valheim/status/stopping etc.?

Also I wonder if I should create them in /opt/valheim or /var/run since some users might start multiple copies of the server and share a single /opt/valheim volume between them (which is completely not supported and might lead to race conditions during update checks but in all likeliness will work regardless).

brandtkeller commented 3 years ago

Appreciate the time you invested in considering this enhancement. This is one of the few major items I have left for a concise orchestration of the image.

With regards to:

What would be better for your use case? If I created e.g. a file /opt/valheim/status with content like starting, ready, stopping or if there were separate files to check for like /opt/valheim/status/starting, /opt/valheim/status/ready, /opt/valheim/status/stopping etc.?

I can implement a very straight-forward readiness/liveness probe that reads from the single /opt/valheim/status (or other location) but can adapt if a multi-file strategy is adopted.

lloesche commented 3 years ago

@brandtkeller just merged #185

Adds an undocumented env variable SERVER_STATUS_FILE which defaults to /var/run/valheim-server.status

The file contains one of the following

bootstrapping
starting
running
stopping
stopped

sisu4u commented 3 years ago

Because the valheim server uses the Valves Server Query Protocol, a simple probe could be sent to the 2457 port. There is, for example, a Python library making this real easy:

import a2s
a2s.info(("localhost", 2457))
# output
SourceInfo(protocol=17, server_name='xxx', map_name='xxx', folder='valheim', game='', app_id=0, player_count=0, max_players=64, bot_count=0, server_type='d', platform='l', password_protected=True, vac_enabled=False, version='1.0.0.0', edf=177, port=2456, steam_id=xxx, stv_port=None, stv_name=None, keywords='0.147.3', game_id=xxx, ping=0.10999999998603016)

This would not only ensure that the server process is running, but that the server is online and can respond to queries. Not sure @lloesche if you'd want another package(s) (i.e., python) in your container for this or use something else.

lloesche commented 3 years ago

@sisu4u that's pretty neat. Python is already installed inside the container. Although any health check could just run from outside the container since 2457 is publicly available.

sisu4u commented 3 years ago

@lloesche yes, however for docker-compose/docker swarm there are no external health checks: https://docs.docker.com/compose/compose-file/compose-file-v3/#healthcheck Therefore, the health-check must be executed inside the container.

Kubernetes can do external health-checks (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#tcp-probes), however, only http / tcp but no udp. Alternatively, as docker-compose commands inside the container can be executed and the exit code is evaluated.

If you like I can create a PR using that script.

lloesche commented 3 years ago

@sisu4u ok but those can just check the status file, right? That file has information that the query port can't provide, like when the server is bootstrapping/starting/stopping/etc.

I just added a script in #188 that allows you to run a status webserver when STATUS_HTTP=true and creates a status.json with all the useful information from that query port. Some fields contain nonsense (like maxplayers: 64) so I left them out and some are just empty, like the player names. The idea is that the json has relatively up to date information (it updates every 10s) and the user can add whatever html/css/js to the STATUS_HTTP_HTDOCS directory that reads that json. Or simply read the json only. It contains only info that's publicly available on the query port but optionaly a STATUS_HTTP_CONF could contain a path to a busybox httpd.conf and limit access by ip network or login/password.

lloesche commented 3 years ago

I added a section https://github.com/lloesche/valheim-server-docker#status-web-server to the README with detailed information.

sisu4u commented 3 years ago

Looks good! The only thing a proper health-check (at least for Kubernetes) would require is that in case of an error, a HTTP status code e.g. 503 must be returned. The JSON output can stay as it is.

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure. Source: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

For liveness probes inside the container (e.g., using curl) the exit code is being used which can be easily achieved using curl --fail. This is relevant for docker-compose / docker-swarm setups.

Some containers with services have dedicated /health endpoints where the framework checks itself and its components for healthiness (e.g., connections to the database, mailserver, redis, ... are working - or not). Therefore, I like the idea of checking the 2457 port and whether the valheim server is able to respond as it might indicate configuration problems or other things which would not occur by only checking that the server process is running. Of course this does not check for problems with the FW or NAT or anything related but this is out of scope here anyways.

lloesche commented 3 years ago

503 must be returned

That's not gonna happen. It's serving static content and the busybox httpd has no custom status code support. But like I mentioned above, those liveness probes can just check that the content of /var/run/valheim-server.status is "running". Plus that's low cost open()/read() operations. Compared to the whole http get chain.

Or if they really wanted to use the status.json (which I don't know why you would) a test "$(jq '.error' /opt/valheim/htdocs/status.json)" = null would do the same.

lloesche commented 3 years ago

which would not occur by only checking that the server process is running

If /var/run/valheim-server.status contains running it means that it is bound to UDP port 2456. Not just that the server process is running.

deedoubledub commented 3 years ago

It appears that /usr/local/bin/valheim-status always returns "error": "timeout('timed out')" unless SERVER_PUBLIC is set to true.

lloesche commented 3 years ago

It appears that /usr/local/bin/valheim-status always returns "error": "timeout('timed out')" unless SERVER_PUBLIC is set to true.

Interesting. I'll add that to the documentation.

lloesche commented 3 years ago

@deedoubledub again thanks for the heads up. I added that info in https://github.com/lloesche/valheim-server-docker/commit/94178f7d6e3dcd5a6c720633e5da6fb9f46b3757 and brought back the old way of detecting connected players for private servers.

I think we can build something for private servers though even without the server answering to Steam server queries. Sort of like we already do for /var/run/valheim-server.status where we get information from the OS to find out more about the server's readiness. It won't be as extensive as the public server status but I feel we can do better than "timeout" 🙂

Addyvan commented 3 years ago

As for running on kubernetes, a simple sidecar with the status file mounted in using a shared emptyDir volume could work?

In a k8s context, prometheus node exporters can provide info on resource util out of the box.

edit: a server inside the main container works fine as well though actually.

lloesche commented 3 years ago

In PR #205 I've added a simple log filter. It replaces the current grep -v based one.

I figured with every line of log output passing this filter we can get a pretty good picture of the server's status. I'm generally not a fan of log parsing as it'll just break when the devs decide to change the format. But I feel like we currently have to track their development closely anyways and given the lack of a proper API this might be the most reliable method of getting an accurate server status for both public and private servers.

So while the first version in this PR is only for removing unwanted log lines, the plan is to use it to actually understand what's going on in the server and create a status from it.

lloesche commented 3 years ago

So turns out the Valheim server logs are pretty bad for the purpose of parsing connected player status. We could get the number of connected players from it but not their names even though the name is in the log.

On connect we're getting

03/12/2021 13:42:48: Got connection SteamID 76561197987220805
03/12/2021 13:42:49: Got handshake from client 76561197987220805

And related but not in any way identifiably connected

03/12/2021 13:43:17: Got character ZDOID from Luu : 1245460760:1

And on disconnect we get

03/12/2021 13:43:57: Closing socket 76561197987220805

So we could track the Steam ID for connecting/disconnecting. But there's no way to track the character name as none of the lines in the log connect the Steam ID with the character name or ID. With our human minds we of course know that when the two log lines are close to each other they are probably related. But assuming that there could be multiple players connecting at the same time this fuzzy logic would easily break. So there's just no reliable way of figuring out which players are currently connected from the logs. Only how many and what their Steam IDs are 🤔

lloesche commented 3 years ago

With https://github.com/lloesche/valheim-server-docker/commit/f7e207da8c4618d1ddd464322380fad9c35d9a34 we now have event hooks for log filters. So you can now implement arbitrary status information whenever somebody connects or disconnects or for anything else that happens in the server log.

setagana commented 3 years ago

503 must be returned

That's not gonna happen. It's serving static content and the busybox httpd has no custom status code support. But like I mentioned above, those liveness probes can just check that the content of /var/run/valheim-server.status is "running". Plus that's low cost open()/read() operations. Compared to the whole http get chain.

Or if they really wanted to use the status.json (which I don't know why you would) a test "$(jq '.error' /opt/valheim/htdocs/status.json)" = null would do the same.

Heya, does the httpd status server return a non-200 code in the case of an error? Or do you have to inspect the payload to determine if there was an error?

I'm asking because I'm writing a terraform script to get this up and running in Azure Container Instances. To get this going I need to be able to make an http request to the container IP and get a 200 back if the service is healthy (can explain in more detail if you're interested).

lloesche commented 3 years ago

The httpd status server is busybox's httpd. It only serves static content. As such status.json to it is just a text file. It has no application awareness and will always return 200 or 404 if the status.json hasn't been generated yet.

lloesche / valheim-server-docker

Healthchecks options for monitoring #170