Closed brandtkeller closed 3 years ago
Not sure if that's possible easily. k8s does not support udp health checks: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#tcp-probes. An alternative would be some sort of script inside the container that checks the server connection when it's called and return exit code != 0 iff failed. K8s can then use this script as a health check.
@brandtkeller I just added something for you in #173 that might be helpful. The process supervisor I'm using inside the container provides an XML-RPC API. This API can provide status information about the containers different services. The most interesting one being valheim-server
I suppose. Details here https://github.com/lloesche/valheim-server-docker#supervisor
I know, XML-RPC, likely not something you can plug natively into a livenessProbe but I figured baby steps. We'll see if it is useful for this issue and if not we can always look for other options.
A simpler readinessProbe could also be to just check if /var/run/valheim-server.pid
exists. If that's the case the valheim_server is running. That does not mean however it is currently accepting connections. It could be in the process of starting up. It could also be constantly crashing and in a restart loop.
Then for pretty high reliability that Valheim server is running and ready to accept connections there is /proc/net/udp6
and /proc/net/udp
.
You could check if anything is bound on $SERVER_PORT
. To find out you would take the second field split at the colon and convert the second field of this split from hex to dec and compare to $SERVER_PORT
.
A one-liner could look like this
awk -v search_port=2456 'BEGIN {ec=1} {if ($1 ~ /^[0-9]/) {split($2, local_bind, ":"); port=sprintf("%d", "0x" local_bind[2]); if (port == search_port){ec=0}}} END {exit ec}' /proc/net/udp*
If this exits 0
a process is listening on search_port
and if it exits 1
no process was found that bound to this port.
Thinking about it, I could just integrate that last check into the valheim-server
start script and produce a more easily consumable status file for you.
What would be better for your use case? If I created e.g. a file /opt/valheim/status
with content like starting
, ready
, stopping
or if there were separate files to check for like /opt/valheim/status/starting
, /opt/valheim/status/ready
, /opt/valheim/status/stopping
etc.?
Also I wonder if I should create them in /opt/valheim
or /var/run
since some users might start multiple copies of the server and share a single /opt/valheim
volume between them (which is completely not supported and might lead to race conditions during update checks but in all likeliness will work regardless).
Appreciate the time you invested in considering this enhancement. This is one of the few major items I have left for a concise orchestration of the image.
With regards to:
What would be better for your use case? If I created e.g. a file /opt/valheim/status with content like starting, ready, stopping or if there were separate files to check for like /opt/valheim/status/starting, /opt/valheim/status/ready, /opt/valheim/status/stopping etc.?
I can implement a very straight-forward readiness/liveness probe that reads from the single /opt/valheim/status
(or other location) but can adapt if a multi-file strategy is adopted.
@brandtkeller just merged #185
Adds an undocumented env variable SERVER_STATUS_FILE
which defaults to /var/run/valheim-server.status
The file contains one of the following
bootstrapping
starting
running
stopping
stopped
Because the valheim server uses the Valves Server Query Protocol, a simple probe could be sent to the 2457 port. There is, for example, a Python library making this real easy:
import a2s
a2s.info(("localhost", 2457))
# output
SourceInfo(protocol=17, server_name='xxx', map_name='xxx', folder='valheim', game='', app_id=0, player_count=0, max_players=64, bot_count=0, server_type='d', platform='l', password_protected=True, vac_enabled=False, version='1.0.0.0', edf=177, port=2456, steam_id=xxx, stv_port=None, stv_name=None, keywords='0.147.3', game_id=xxx, ping=0.10999999998603016)
This would not only ensure that the server process is running, but that the server is online and can respond to queries. Not sure @lloesche if you'd want another package(s) (i.e., python) in your container for this or use something else.
@sisu4u that's pretty neat. Python is already installed inside the container. Although any health check could just run from outside the container since 2457 is publicly available.
@lloesche yes, however for docker-compose/docker swarm there are no external health checks: https://docs.docker.com/compose/compose-file/compose-file-v3/#healthcheck Therefore, the health-check must be executed inside the container.
Kubernetes can do external health-checks (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#tcp-probes), however, only http / tcp but no udp. Alternatively, as docker-compose commands inside the container can be executed and the exit code is evaluated.
If you like I can create a PR using that script.
@sisu4u ok but those can just check the status file, right? That file has information that the query port can't provide, like when the server is bootstrapping/starting/stopping/etc.
I just added a script in #188 that allows you to run a status webserver when STATUS_HTTP=true
and creates a status.json
with all the useful information from that query port. Some fields contain nonsense (like maxplayers: 64) so I left them out and some are just empty, like the player names. The idea is that the json has relatively up to date information (it updates every 10s) and the user can add whatever html/css/js to the STATUS_HTTP_HTDOCS
directory that reads that json. Or simply read the json only. It contains only info that's publicly available on the query port but optionaly a STATUS_HTTP_CONF
could contain a path to a busybox httpd.conf and limit access by ip network or login/password.
I added a section https://github.com/lloesche/valheim-server-docker#status-web-server to the README with detailed information.
Looks good! The only thing a proper health-check (at least for Kubernetes) would require is that in case of an error, a HTTP status code e.g. 503 must be returned. The JSON output can stay as it is.
Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure. Source: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request
For liveness probes inside the container (e.g., using curl) the exit code is being used which can be easily achieved using curl --fail
.
This is relevant for docker-compose / docker-swarm setups.
Some containers with services have dedicated /health
endpoints where the framework checks itself and its components for healthiness (e.g., connections to the database, mailserver, redis, ... are working - or not). Therefore, I like the idea of checking the 2457 port and whether the valheim server is able to respond as it might indicate configuration problems or other things which would not occur by only checking that the server process is running. Of course this does not check for problems with the FW or NAT or anything related but this is out of scope here anyways.
503 must be returned
That's not gonna happen. It's serving static content and the busybox httpd has no custom status code support. But like I mentioned above, those liveness probes can just check that the content of /var/run/valheim-server.status is "running". Plus that's low cost open()/read() operations. Compared to the whole http get chain.
Or if they really wanted to use the status.json (which I don't know why you would) a test "$(jq '.error' /opt/valheim/htdocs/status.json)" = null
would do the same.
which would not occur by only checking that the server process is running
If /var/run/valheim-server.status
contains running
it means that it is bound to UDP port 2456. Not just that the server process is running.
It appears that /usr/local/bin/valheim-status
always returns "error": "timeout('timed out')"
unless SERVER_PUBLIC
is set to true
.
It appears that
/usr/local/bin/valheim-status
always returns"error": "timeout('timed out')"
unlessSERVER_PUBLIC
is set totrue
.
Interesting. I'll add that to the documentation.
@deedoubledub again thanks for the heads up. I added that info in https://github.com/lloesche/valheim-server-docker/commit/94178f7d6e3dcd5a6c720633e5da6fb9f46b3757 and brought back the old way of detecting connected players for private servers.
I think we can build something for private servers though even without the server answering to Steam server queries. Sort of like we already do for /var/run/valheim-server.status where we get information from the OS to find out more about the server's readiness. It won't be as extensive as the public server status but I feel we can do better than "timeout" 🙂
As for running on kubernetes, a simple sidecar with the status file mounted in using a shared emptyDir
volume could work?
In a k8s context, prometheus node exporters can provide info on resource util out of the box.
edit: a server inside the main container works fine as well though actually.
In PR #205 I've added a simple log filter. It replaces the current grep -v
based one.
I figured with every line of log output passing this filter we can get a pretty good picture of the server's status. I'm generally not a fan of log parsing as it'll just break when the devs decide to change the format. But I feel like we currently have to track their development closely anyways and given the lack of a proper API this might be the most reliable method of getting an accurate server status for both public and private servers.
So while the first version in this PR is only for removing unwanted log lines, the plan is to use it to actually understand what's going on in the server and create a status from it.
So turns out the Valheim server logs are pretty bad for the purpose of parsing connected player status. We could get the number of connected players from it but not their names even though the name is in the log.
On connect we're getting
03/12/2021 13:42:48: Got connection SteamID 76561197987220805
03/12/2021 13:42:49: Got handshake from client 76561197987220805
And related but not in any way identifiably connected
03/12/2021 13:43:17: Got character ZDOID from Luu : 1245460760:1
And on disconnect we get
03/12/2021 13:43:57: Closing socket 76561197987220805
So we could track the Steam ID for connecting/disconnecting. But there's no way to track the character name as none of the lines in the log connect the Steam ID with the character name or ID. With our human minds we of course know that when the two log lines are close to each other they are probably related. But assuming that there could be multiple players connecting at the same time this fuzzy logic would easily break. So there's just no reliable way of figuring out which players are currently connected from the logs. Only how many and what their Steam IDs are 🤔
With https://github.com/lloesche/valheim-server-docker/commit/f7e207da8c4618d1ddd464322380fad9c35d9a34 we now have event hooks for log filters. So you can now implement arbitrary status information whenever somebody connects or disconnects or for anything else that happens in the server log.
503 must be returned
That's not gonna happen. It's serving static content and the busybox httpd has no custom status code support. But like I mentioned above, those liveness probes can just check that the content of /var/run/valheim-server.status is "running". Plus that's low cost open()/read() operations. Compared to the whole http get chain.
Or if they really wanted to use the status.json (which I don't know why you would) a
test "$(jq '.error' /opt/valheim/htdocs/status.json)" = null
would do the same.
Heya, does the httpd status server return a non-200 code in the case of an error? Or do you have to inspect the payload to determine if there was an error?
I'm asking because I'm writing a terraform script to get this up and running in Azure Container Instances. To get this going I need to be able to make an http request to the container IP and get a 200 back if the service is healthy (can explain in more detail if you're interested).
The httpd status server is busybox's httpd. It only serves static content. As such status.json
to it is just a text file. It has no application awareness and will always return 200 or 404 if the status.json hasn't been generated yet.
Less of a request and more of general knowledge gathering (if available).
I am writing my own orchestration for this image on kubernetes following some design standards that I feel are important/different from other implementations.
As such, I was curious if there are any exposed options for monitoring the health of the server?
These may be actions that establish - "Has the game server fully started?", "Is the game server still running?"
Any help would be appreciated! Thanks