matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Synapse keeps being killed due to healthcheck failures #13628

Closed IeuanK closed 2 years ago

IeuanK commented 2 years ago

Description

I am running a docker stack with the official Synapse image in it.

Absolutely randomly, Synaps just shuts down. See log; it literally just receives a SIGTERM and decides to shut itself down. Everything up to where the copied log starts is just normal Synapse logging

It's driving me absolutely nuts, why would it possibly do this? Is this a new 'feature' in 1.65?

Steps to reproduce

Homeserver

My own homeserver

Synapse Version

v1.65.0

Installation Method

Docker (matrixdotorg/synapse)

Platform

Docker on debian

Relevant log output

matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:30,627 - synapse.storage.databases.main.event_push_actions - 969 - INFO - rotate_notifs-62 - Rotating notifications
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:30,628 - synapse.storage.databases.main.event_push_actions - 1130 - INFO - rotate_notifs-62 - Rotating notifications up to: 1326650
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:30,631 - synapse.storage.databases.main.event_push_actions - 1218 - INFO - rotate_notifs-62 - Rotating notifications, handling 0 rows
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:30,661 - synapse.storage.databases.main.event_push_actions - 1293 - INFO - rotate_notifs-62 - Rotating notifications, deleted 0 push actions
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:30,951 - synapse.util.caches.lrucache - 212 - INFO - LruCache._expire_old_entries-62 - Dropped 0 items from caches
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,819 - twisted - 274 - INFO - sentinel - Received SIGTERM, shutting down.
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,820 - synapse.storage.databases.main.lock - 92 - INFO - LockStore._on_shutdown-0 - Dropping held locks due to shutdown
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,821 - synapse.storage.databases.main.lock - 101 - INFO - LockStore._on_shutdown-0 - Dropped locks due to shutdown
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,821 - synapse.handlers.presence - 766 - INFO - presence.on_shutdown-0 - Performing _on_shutdown. Persisting 7 unpersisted changes
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,822 - synapse.app._base - 492 - INFO - sentinel - Shutting down...
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,912 - synapse.handlers.presence - 779 - INFO - presence.on_shutdown-0 - Finished _on_shutdown
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,913 - synapse.http.site - 362 - INFO - GET-553 - Connection from client lost before response was sent
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,914 - synapse.http.site - 362 - INFO - GET-556 - Connection from client lost before response was sent
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,914 - twisted - 274 - INFO - sentinel - (TCP Port 8008 Closed)
matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,919 - twisted - 274 - INFO - sentinel - Main loop terminated.

Anything else that would be useful to know?

No response

DMRobertson commented 2 years ago

Is this a new 'feature' in 1.65?

Do you see his behaviour when running 1.64?

Synaps just shuts down. See log; it literally just receives a SIGTERM and decides to shut itself down.

This is not a bug. Synapse gracefully shuts itself down when requested by a SIGTERM. (We could choose to ignore the signal, but that would make Synapse a bad citizen.)

You need to work out what's sending the SIGTERM and why it's doing so. I would guess that there is some kind of resource contention (RAM?) and a monitoring process terminates Synapse to reclaim those resources.

I would recommend setting up metrics monitoring with prometheus/grafana. That might offer clues as to why Synapse is asked to terminate.

I am running a docker stack with the official Synapse image in it.

What does "docker stack" mean---docker-compose? Are there any relevant logs from docker itself rather than the container which might be pertinent?

IeuanK commented 2 years ago

@DMRobertson

Do you see his behaviour when running 1.64?

I upgraded from v1.50.2 according to the docker images I have pulled, the issue did not appear there.

You need to work out what's sending the SIGTERM and why it's doing so. I would guess that there is some kind of resource contention (RAM?) and a monitoring process terminates Synapse to reclaim those resources.

The system has 48 cores and 173GB memory, I have never seen more than ~300% CPU usage at any given time, and it usually has over 100GB memory free. There is nothing actively pruning busy containers, only Nginx Proxy Manager, hif1/heisenbridge and mautrix/whatsapp have any way to connect to Synapse. No other containers, even ones that have much more resource usage, are getting shut down.
Whatever it is has to be either inside the matrixdotorg/synapse image, or one of the two bridges, but those show no logging besides the ordinary events either.

What does "docker stack" mean---docker-compose? Are there any relevant logs from docker itself rather than the container which might be pertinent?

I use docker swarm stacks, so yes docker compose. Below is my docker-compose file for this stack. Note the nginxpm network that is used solely for the reverse proxy to the Synapse server name. Docker itself keeps metrics that all look perfectly fine, Synapse uses so few resources I can't imagine that it's bottlenecked or being cut off in any way.

version: '3.7'
services:
    synapse:
        image: matrixdotorg/synapse:v1.65.0
        deploy:
            replicas: 1
        environment:
            - SYNAPSE_SERVER_NAME=<snip>
            - SYNAPSE_REPORT_STATS=yes
        volumes:
            - /vmdata/services/matrix_synapse/data:/data
            - /vmdata/services/matrix_synapse:/synapse
            - /vmdata/services/matrix_synapse/data/py/shared_secret_authenticator.py:/usr/local/lib/python3.9/site-packages/shared_secret_authenticator.py
        networks:
            - default
            - nginxpm
    whatsapp:
        image: dock.mau.dev/mautrix/whatsapp:latest
        deploy:
            replicas: 1
        volumes:
            - /vmdata/services/matrix_synapse:/synapse
            - /vmdata/services/matrix_synapse/mautrix-whatsapp:/data
        networks:
            - default
    heisenbridge:
        image: hif1/heisenbridge:1.14
        deploy:
            replicas: 1
        volumes:
            - /vmdata/services/matrix_synapse:/synapse
            - /vmdata/services/matrix_synapse/heisenbridge:/data
        entrypoint: /data/heisenbridge-init.sh
        networks:
            - default

networks:
    nginxpm:
        external: true
    common:
        external: true
DMRobertson commented 2 years ago

Do you see his behaviour when running 1.64?

I upgraded from v1.50.2 according to the docker images I have pulled, the issue did not appear there.

That's quite a wide version range; it'd be great if you could narrow that down by trying to reproduce the issue with intermediate releases.

I note that the image has a periodic HEALTHCHECK. I wonder if that is failing to respond all of a sudden? I can't find a solid reference for what docker will do if a HEALTHCHECK fails, but it sounds reasonable that docker (or some layer above it) would to stop and restart the container. Can you see anything in Synapse's logs which indicate requests to /heatlh?

https://github.com/matrix-org/synapse/blob/738c11729afcb978b7d9551af3c4cf7990163d07/docker/Dockerfile#L169-L170

Additionally, the Dockerfile docs say

To help debug failing probes, any output text (UTF-8 encoded) that the command writes on stdout or stderr will be stored in the health status and can be queried with docker inspect. Such output should be kept short (only the first 4096 bytes are stored currently).

When the health status of a container changes, a health_status event is generated with the new status.

I suggest looking for evidence of failing healthchecks. In particular docker events --since TIMESTAMP looks like it'll be useful, if you know roughly when synapse received one of these SIGTERMs.

DMRobertson commented 2 years ago

Can you see anything in Synapse's logs which indicate requests to /heatlh?

Actually, you probably won't find anything here unless you have debug logging enabled:

https://github.com/matrix-org/synapse/blob/b93259082c7d8d3fe8376a646e130213d90069dc/synapse/rest/health.py#L23-L24

So it's probably best to concentrate on docker inspect or docker events here.

IeuanK commented 2 years ago

The Synapse log does mention 'health': matrix_synapse.1.cwo8bra2rt2u@beast | 2022-08-25 10:07:49,350 - synapse.util.httpresourcetree - 47 - INFO - sentinel - Attaching <synapse.rest.health.HealthResource object at 0x7f350a5c9f10> to path b'/health'

I ran docker events | grep synapse to monitor for any events from this service, which does show a continuous cycle of the events below. Seems like this is just the health check, which is working fine. I added some extra line breaks and indents for readability:

2022-08-25T13:38:25.224044162+02:00 
    container exec_create: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (
        com.docker.stack.namespace=matrix,
        com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0,
        com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0,
        com.docker.swarm.service.name=matrix_synapse,
        com.docker.swarm.task=,
        com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v,
        com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v,
        execID=35811821f96fda310e55e8eff3d30547f4b3ea7eef7bcd79de1275a91e6cc7cf,
        gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07,
        image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889,
        name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v,
        org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md,
        org.opencontainers.image.licenses=Apache-2.0,
        org.opencontainers.image.source=https://github.com/matrix-org/synapse.git,
        org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse
    )

2022-08-25T13:38:25.224381317+02:00 
    container exec_start: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (
        com.docker.stack.namespace=matrix,
        com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0,
        com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0,
        com.docker.swarm.service.name=matrix_synapse,
        com.docker.swarm.task=,
        com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v,
        com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v,
        execID=35811821f96fda310e55e8eff3d30547f4b3ea7eef7bcd79de1275a91e6cc7cf,
        gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07,
        image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889,
        name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v,
        org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md,
        org.opencontainers.image.licenses=Apache-2.0,
        org.opencontainers.image.source=https://github.com/matrix-org/synapse.git,
        org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse
    )

2022-08-25T13:38:25.378195984+02:00 
    container exec_die f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (
        com.docker.stack.namespace=matrix,
        com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0,
        com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0,
        com.docker.swarm.service.name=matrix_synapse,
        com.docker.swarm.task=,
        com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v,
        com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v,
        execID=35811821f96fda310e55e8eff3d30547f4b3ea7eef7bcd79de1275a91e6cc7cf,
        exitCode=0,
        gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07,
        image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889,
        name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v,
        org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md,
        org.opencontainers.image.licenses=Apache-2.0,
        org.opencontainers.image.source=https://github.com/matrix-org/synapse.git,
        org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse
    )
IeuanK commented 2 years ago

Important to note in the above that the container itself does not reboot, the create, start and die are just for the health check process. The task ID (container ID) stays the same.

IeuanK commented 2 years ago

If I exec into the synapse container and manually run curl http://localhost:8008/health it just returns OK, so that is also working from the inside

DMRobertson commented 2 years ago

You've demonstrated that the healthcheck is currently succeeding. What we want to know is if the healthcheck failed just before Synapse received a SIGTERM.

From your logs:

matrix_synapse.1.mybcsb5py8xx@beast    | 2022-08-25 05:31:39,819 - twisted - 274 - INFO - sentinel - Received SIGTERM, shutting down.

I suggest using docker events --since '2022-08-25T05:31:00'.

IeuanK commented 2 years ago

I sadly don't seem to be getting any events for the task that was shut down, only the current one (cwo8bra2rt2u)

cwo8bra2rt2u   matrix_synapse.1        matrixdotorg/synapse:v1.65.0           beast     Running         Running 2 hours ago
9k6elyuqfm12    \_ matrix_synapse.1    matrixdotorg/synapse:v1.65.0           beast     Shutdown        Complete 2 hours ago
css4r2i9s8he    \_ matrix_synapse.1    matrixdotorg/synapse:v1.65.0           beast     Shutdown        Complete 2 hours ago
p4owu5h29kh2    \_ matrix_synapse.1    matrixdotorg/synapse:v1.65.0           beast     Shutdown        Complete 5 hours ago
y26oljfjpyin    \_ matrix_synapse.1    matrixdotorg/synapse:v1.65.0           beast     Shutdown        Complete 5 hours ago

These are the current and shut down tasks, I've tried various timestamps and checked for the four IDs that were shut down, but no dice. I'll start up a manual log of the events to a file, then check that file when it next crashes.

IeuanK commented 2 years ago

I've now confirmed that it does go 'unhealthy' and then kills itself. Also interesting is that (as shown above as well) it creates a new task, that task sticks on 'preparing' for a long time, then dies, then a second task is created that just immediately goes to 'running'.

Now to figure out what could possibly be causing it to go unhealthy, and why it happens every few hours. I guess the next step is turning on debug logging for Synapse? It looks like it hits a 10-second timeout, but I saw no spike in resource usage on the host or in docker stats.

2022-08-25T14:33:05.800870614+02:00 container exec_create: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v, com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, execID=d5042bc424362b673a67a6025e8baf16a0bf3e8c4049a8e1f6f075cd5c974b7e, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:33:05.801184313+02:00 container exec_start: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v, com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, execID=d5042bc424362b673a67a6025e8baf16a0bf3e8c4049a8e1f6f075cd5c974b7e, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:33:15.977885702+02:00 container health_status: unhealthy f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v, com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:33:18.035644396+02:00 container kill f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v, com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse, signal=15)
2022-08-25T14:33:18.484391657+02:00 container die f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v, com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, exitCode=0, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:33:47.167616606+02:00 container stop f2e8851dbbe5eac67f322f738f45e03ea392a626cc2e3235def5053065832023 (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=cwo8bra2rt2usa85btcjiw28v, com.docker.swarm.task.name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.cwo8bra2rt2usa85btcjiw28v, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:34:33.290960396+02:00 container destroy 5f35d2d2ef4bcc7d8fb1e9d7ec8540782371a5160e66c70e395b60ea3644f88c (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=y26oljfjpyinht0ceboggb4mi, com.docker.swarm.task.name=matrix_synapse.1.y26oljfjpyinht0ceboggb4mi, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.y26oljfjpyinht0ceboggb4mi, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:35:47.513400678+02:00 container create 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:35:48.499041604+02:00 container start 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:03.497398070+02:00 container exec_create: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=6294e44a13d2b49d92da7acd05e1d1ff9913811f8d1f6d61d025841fc425c579, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:03.497622599+02:00 container exec_start: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=6294e44a13d2b49d92da7acd05e1d1ff9913811f8d1f6d61d025841fc425c579, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:29.096652766+02:00 container exec_create: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=136a98975bea8924e37996960086ec8d4ef9118b4ab1e8a40cfa0efcf529f31d, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:29.096977432+02:00 container exec_start: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=136a98975bea8924e37996960086ec8d4ef9118b4ab1e8a40cfa0efcf529f31d, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:29.412375220+02:00 container exec_die 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=136a98975bea8924e37996960086ec8d4ef9118b4ab1e8a40cfa0efcf529f31d, exitCode=0, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:29.426259367+02:00 container health_status: healthy 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:44.427902100+02:00 container exec_create: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=46d4b83173fdb4226fc154652ec1bbf204a58b61c9f07bc4617cae492208bc11, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:44.428162624+02:00 container exec_start: /bin/sh -c curl -fSs http://localhost:8008/health || exit 1 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=46d4b83173fdb4226fc154652ec1bbf204a58b61c9f07bc4617cae492208bc11, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
2022-08-25T14:36:44.530425801+02:00 container exec_die 1e0de4fd9e02684a6689833d96039652a5439ba47b1385914b42307eda2e316f (com.docker.stack.namespace=matrix, com.docker.swarm.node.id=jgp56j490qx9agdu1bjb4vfk0, com.docker.swarm.service.id=0yl3gflz4k58clxpcfd8n73h0, com.docker.swarm.service.name=matrix_synapse, com.docker.swarm.task=, com.docker.swarm.task.id=gcsav6el4nr0vac71zfrxx81y, com.docker.swarm.task.name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, execID=46d4b83173fdb4226fc154652ec1bbf204a58b61c9f07bc4617cae492208bc11, exitCode=0, gitsha1=738c11729afcb978b7d9551af3c4cf7990163d07, image=matrixdotorg/synapse:v1.65.0@sha256:edea7f61a110c2f0eb9aa7821d27f47d20f62c0f59364aea90c5d3381625b889, name=matrix_synapse.1.gcsav6el4nr0vac71zfrxx81y, org.opencontainers.image.documentation=https://github.com/matrix-org/synapse/blob/master/docker/README.md, org.opencontainers.image.licenses=Apache-2.0, org.opencontainers.image.source=https://github.com/matrix-org/synapse.git, org.opencontainers.image.url=https://matrix.org/docs/projects/server/synapse)
DMRobertson commented 2 years ago

I've now confirmed that it does go 'unhealthy' and then kills itself.

To be explicit: Synapse doesn't send the SIGTERM: Synapse receives it, presumably from some part of your docker stack.

I guess the next step is turning on debug logging for Synapse?

That won't hurt, but I think it'll be easier to see what's going on from prometheus metrics because that gives you graphs over time.

Database contention might be a problem. I assume you're using SQLite rather than postgres as your database? (Some servers are bottlenecked by SQLite's write throughput and we strongly recommend using postgres for any "serious" deployments.)

IeuanK commented 2 years ago

I will set up Prometheus to see if that shows anything relevant, been meaning to do so anyway for monitoring my other containers so now's as good a time as any.

Database contention might be a problem. I assume you're using SQLite rather than postgres as your database? (Some servers are bottlenecked by SQLite's write throughput and we strongly recommend using postgres for any "serious" deployments.)

I am in fact using postgres, per the docs' recommendation. The official postgres docker image runs in a separate swarm stack. I seem to have accidentally cut the network out when redacting the docker-compose I posted, but there is a network connection to that database that's working as expected. No significant resource usage for that container either, so can't imagine that'd be causing any issues here.

richvdh commented 2 years ago

To be honest I think your infrastructure is being a bit aggressive by killing Synapse on a single failed health check. Alternatively, maybe you should override the timeout for the healthcheck.

It's not unknown for Synapse to stop responding for 5 seconds, particularly when python garbage-collects.

IeuanK commented 2 years ago

I have to admit that I'm not sure how to do that in this case.

The Dockerfile has HEALTHCHECK --start-period=5s --interval=15s --timeout=5s in it, with no retries configured. According to the Dockerfile reference it defaults to three retries.

I don't have anything set up to change this behavior, other than enabling the zfs storage driver my docker itself is all default.

I'll see if I can override the healthcheck in the docker-compose, maybe I just need to increase the timeout further?

IeuanK commented 2 years ago

I have added the following to the Synapse service:

        healthcheck:
            test: curl -fSs http://localhost:8008/health || exit 1
            interval: 30s
            retries: 5
            start_period: 10s
            timeout: 30s

Will see if this makes a difference

reivilibre commented 2 years ago

Did that sort you out?

IeuanK commented 2 years ago

I think so, all tasks in my Matrix stack have been running since I rebooted the docker daemon for something 5 days ago.

So I guess this issue can be closed, with increasing the health check timeout as the solution. Thanks @DMRobertson and @richvdh for helping me figure this out!