trampgeek / jobeinabox

The dockerfile and doc for building the Docker image JobeInABox
MIT License
12 stars 29 forks source link

Adjust HEALTHCHECK so docker can see the container has started faster #19

Closed timhunt closed 1 month ago

timhunt commented 2 months ago

Without this docker would thing the container was only in state (health: starting) for 5 minutes, before it went to (healthy), which was not ideal.

The docs --start-interval say "This option requires Docker Engine version 25.0 or later." (https://docs.docker.com/reference/dockerfile/#healthcheck). I don't know if that is an issue? Docker 25 was released in January this year https://docs.docker.com/engine/release-notes/25.0/#2500

trampgeek commented 1 month ago

Hi Tim

Thanks for suggesting this. I've never used the healthcheck functionality - it was added by a collaborator and I've never seiously thought about it. We don't use Docker for our production Jobe servers.

I'm not happy to have changes that would make the Dockerfile incompatible with versions prior to Docker 25, as it's very recent and I suspect most users will have old versions.

I've come up with a compromise - I changed the healthcheck time interval to 1m. The healthcheck takes around 370ms on my laptop, so even on a 1-core Jobe server the healthcheck overheads should still be under around 0.5%. But I've added your proposed change to the Dockerfile as a comment.

Is a 1 minute interval before JobeInABox reports that it's healthy tolerable? I don't have a compelling reason for using 1m rather than 30s, which would still be less than a (1/n) % overhead on an n-core server. I guess I just like nice round numbers.

I also looked at what the heathcheck was doing: running minimaltest.py. I'm not sure this is doing what you and other users want/expect. It's a Python program that attempts to run a simple C compile-and-run task and reports on the result. But even if the run fails, minimaltest.py doesn't generally exit with a non-zero return code (unless there's an uncaught exception) so most failures won't actually fail the healthcheck.

I've changed minimaltest.py in the Jobe repo so that it now just prints "Test passed" if the C compile-and-run succeeds (without checking the actual stdout) but prints diagnostic info and exits with a return code of 1 if the run fails. I assume this is the sort of behaviour you expect? Please clarify, as I don't now what the implications are of a failed healthcheck.

The other thing I'd like to query is the 2s timeout. Certainly under reasonable load the compile-and-run will complete in 2s but under heavy load it mightn't, even though Jobe is still running all jobs OK. Could this be a problem?

Richard

timhunt commented 1 month ago

Thanks Richard. I am hardly

I suppose one question, is C compile-and-run the best heathcheck here? would, say, a Python test run faster and be just as good?

I think that the point of this sort of healthcheck is when you are running the docker containers with something like Kubernetes to do automatic auto-scaling and resiliance - the controller needs a way to know when a particular container is ready for user - and to monitor that it stays healthy.

I hit this in a different situation (which I posted about in the CodeRunner forums), trying to modify https://github.com/moodlehq/moodle-ci-runner to add CodeRunner support (by adding a new module like this one https://github.com/moodlehq/moodle-ci-runner/blob/main/runner/main/modules/docker-solr/docker-solr.sh) - the way that works relies on the healthcheck for any container passing within 2 minutes (but faster is better). Anyway, I got past it by re-building the Docker container in my build script, and using sed to edit the Docker file first. (sed -i "s|--interval=5m|--interval=5s|" /var/jobeinabox/Dockerfile) - in that context 5s made sense - and, i guess, I need to edit that line of code now, to be more flexible, and not just look for 5m. (I guess sed -i "s|--interval=[0-9]+[ms]|--interval=5s|" /var/jobeinabox/Dockerfile)

trampgeek commented 1 month ago

Thanks for the clarification, Tim.

Switching between C and Python makes no difference, assuming you're doing a full Jobe run. For those languages, a trivial run takes under 100 msecs so the time for the full jobe run is dominated by overheads (running the parent Python process, opening a socket, Apache firing up the CodeIgniter framework that runs lots of PHP to handle the REST request etc).