research-software-directory / RSD-as-a-service

This repo contains the new RSD-as-a-service implementation
https://research.software
23 stars 15 forks source link

auth container unresponsive after 60s of no connections #1073

Closed cmeessen closed 8 months ago

cmeessen commented 8 months ago

We currently experience problems with the auth container on our staging server. The container does not accept connections after about 60s without connection attempts. The problem first occurred when testing v2.1.1 after the docker version on the server was updated. This is a list of packages on our staging server that, when upgraded, will result in the bug:

$ apt list --upgradable
Listing... Done
containerd.io/jammy 1.6.26-1 amd64 [upgradable from: 1.6.21-1]
cryptsetup-bin/jammy-updates 2:2.4.3-1ubuntu1.2 amd64 [upgradable from: 2:2.4.3-1ubuntu1.1]
cryptsetup-initramfs/jammy-updates 2:2.4.3-1ubuntu1.2 all [upgradable from: 2:2.4.3-1ubuntu1.1]
cryptsetup/jammy-updates 2:2.4.3-1ubuntu1.2 amd64 [upgradable from: 2:2.4.3-1ubuntu1.1]
docker-buildx-plugin/jammy 0.11.2-1~ubuntu.22.04~jammy amd64 [upgradable from: 0.10.5-1~ubuntu.22.04~jammy]
docker-ce-cli/jammy 5:24.0.7-1~ubuntu.22.04~jammy amd64 [upgradable from: 5:24.0.2-1~ubuntu.22.04~jammy]
docker-ce-rootless-extras/jammy 5:24.0.7-1~ubuntu.22.04~jammy amd64 [upgradable from: 5:24.0.2-1~ubuntu.22.04~jammy]
docker-ce/jammy 5:24.0.7-1~ubuntu.22.04~jammy amd64 [upgradable from: 5:24.0.2-1~ubuntu.22.04~jammy]
docker-compose-plugin/jammy 2.21.0-1~ubuntu.22.04~jammy amd64 [upgradable from: 2.18.1-1~ubuntu.22.04~jammy]
gitlab-runner/jammy 16.6.1 amd64 [upgradable from: 16.0.1]
libcryptsetup12/jammy-updates 2:2.4.3-1ubuntu1.2 amd64 [upgradable from: 2:2.4.3-1ubuntu1.1]
telegraf/unknown 1.29.0-1 amd64 [upgradable from: 1.26.3-1]

Some more observations

As the problem occurs with v2.1.1, it appears that it's related to the combination of docker version and the upgraded Java version inside the auth container.

We could not yet discover a proper solution, but adding a healthcheck to the auth container keeps it alive:

    restart: unless-stopped
    healthcheck:
      # Workaround to keep auth alive
      test: "curl --connect-timeout 0.25 --max-time 0.5 --fail --silent --no-keepalive -H 'Connection: close' --output /dev/null http://localhost:7000 || bash -c 'kill -s 15 $(pidof java) && (sleep 10; kill -s 9 $(pidof java))'"
      interval: 10s
      timeout: 2s
      retries: 1

It checks every 10s whether the container accepts connections, and if not will kill the java process. Docker will then restart the container according to the restart policy.

ewan-escience commented 8 months ago

I cannot reproduce this, making it hard for me to debug, although I've also seen this on someone else's VM.

Some suggestions to try to find out what coponent causes the bug:

If someone who experiences this bug can give me root access to a (toy) VM, I can maybe look into it further.

One idea is to change the Auth service to open a ServerSocket manually to see if the same bug happens when sending TCP requests to the container.