rystaf / mlmym

a familiar desktop experience for lemmy
GNU Affero General Public License v3.0
246 stars 16 forks source link

Health check #88

Closed dvanderveer closed 9 months ago

dvanderveer commented 9 months ago

Problem

Lemmy.world runs multiple alternate UIs, including mlmym. We recently had an outage for alt UIs due to a misconfiguration on our server, which went uncaught longer than we'd like. While we do have monitoring and alerting for alt UIs, a Docker health check would provide more immediate feedback to admins during site maintenance.

Implementation Notes

A basic health check using curl or wget doesn't seem appropriate for a couple reasons:

Instead, this PR uses golang's html.Parse to parse the index page.

Solution

Add a basic health checker to the container image. The health checker returns exit code 1 in any of the following scenarios:

go.mod and go.sum were updated by go mod tidy.

Testing Done

Successfully built the docker container image, then launched containers for a nonexistent Lemmy site and a known-good Lemmy site. Confirmed that the health checks reported correctly for each container:

$ docker buildx build -t dvmlmym .
[+] Building 7.1s (23/23) FINISHED                                                                                   docker:default
 => [internal] load .dockerignore                                                                                              0.0s
 => => transferring context: 2B                                                                                                0.0s
 => [internal] load build definition from Dockerfile                                                                           0.0s
 => => transferring dockerfile: 795B                                                                                           0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                                                        0.2s
 => [internal] load metadata for docker.io/library/golang:1.20-bullseye                                                        0.2s
 => [builder 1/9] FROM docker.io/library/golang:1.20-bullseye@sha256:c4aae7dbd205196eef1c2ac4dd2a8d576b72556fa6687a8b67c3f4a2  0.0s
 => [stage-1 1/8] FROM docker.io/library/debian:bullseye-slim@sha256:c618be84fc82aa8ba203abbb07218410b0f5b3c7cb6b4e7248fda778  0.0s
 => [internal] load build context                                                                                              0.2s
 => => transferring context: 8.61kB                                                                                            0.2s
 => CACHED [builder 2/9] RUN git config --global --add safe.directory /app                                                     0.0s
 => CACHED [builder 3/9] WORKDIR /app                                                                                          0.0s
 => CACHED [builder 4/9] COPY go.* ./                                                                                          0.0s
 => CACHED [builder 5/9] RUN go mod download                                                                                   0.0s
 => [builder 6/9] COPY . ./                                                                                                    0.0s
 => [builder 7/9] RUN git describe --tag > VERSION                                                                             0.2s
 => [builder 8/9] RUN go build -v -o mlmym                                                                                     5.7s
 => [builder 9/9] RUN go build -v -o health-check ./healthcheck                                                                0.6s
 => CACHED [stage-1 2/8] WORKDIR /app                                                                                          0.0s
 => CACHED [stage-1 3/8] RUN set -x && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y     ca-certificate  0.0s
 => CACHED [stage-1 4/8] COPY --from=builder /app/mlmym /app/mlmym                                                             0.0s
 => CACHED [stage-1 5/8] COPY --from=builder /app/templates /app/templates                                                     0.0s
 => CACHED [stage-1 6/8] COPY --from=builder /app/public /app/public                                                           0.0s
 => CACHED [stage-1 7/8] COPY --from=builder /app/VERSION /app/VERSION                                                         0.0s
 => CACHED [stage-1 8/8] COPY --from=builder /app/health-check /app/health-check                                               0.0s
 => exporting to image                                                                                                         0.0s
 => => exporting layers                                                                                                        0.0s
 => => writing image sha256:111256d44c958aaec2093d37d460fafab2d072da19c6d76934f971d0ae5c735a                                   0.0s
 => => naming to docker.io/library/dvmlmym
$ docker run -d -e LEMMY_DOMAIN=doesnt.exist -p 8080:8080 --name mlmym_bad dvmlmym
096be3dde8f5f095d00ed8812819405d1c4fdf7b3925a61a4629898271c3b74b
$ docker run -d -e LEMMY_DOMAIN=lemmy.world -p 8081:8080 --name mlmym_good dvmlmym
f816087d370b3c7cf2722e88502b19128ec03832e53ba4134dbb19b05c05b4a6
$ docker ps
CONTAINER ID   IMAGE     COMMAND                  CREATED          STATUS                      PORTS                                       NAMES
f816087d370b   dvmlmym   "./mlmym --addr 0.0.…"   7 seconds ago    Up 7 seconds (healthy)      0.0.0.0:8081->8080/tcp, :::8081->8080/tcp   mlmym_good
096be3dde8f5   dvmlmym   "./mlmym --addr 0.0.…"   23 seconds ago   Up 23 seconds (unhealthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp   mlmym_bad

Also confirmed that both containers loaded as expected in a browser, with the "good" container showing lemmy.world and the "bad" container showing the stub UI with "unable to retrieve site" in red text.

rystaf commented 9 months ago

I would support a pull request that added curl to the docker image. It should be sufficient to rely on the status code of the response. If you're not seeing a 500 status with the "unable to retrieve site" error then that would be a bug I'd like to see fixed. If you have any evidence of that happening, please share.

As for the healthcheck, I would recommend using your docker-compose.yml to configure it. This way it's an opt-in extra call to the api and also easier to configure the target url for those running in multi-instance mode.

dvanderveer commented 9 months ago

I've confirmed that the site does indeed return error 500 in the unreachable Lemmy site scenario. I must have mixed up mlmym test results with one of the other alt UI containers I was working on. Sorry for the bunk PR! I'll close this one and open a new PR that adds curl to the container for healthcheck purposes.