simple-framework / simple_grid_puppet_module

Central Configuration Module, implemented in Puppet, for SIMPLE-Grid (a solution for setting up Lightweight Sites for the Worldwide LHC Computing Grid)
Apache License 2.0
2 stars 6 forks source link

Disable Docker auto-updates and add restart policy #203

Closed maany closed 4 years ago

maany commented 4 years ago

Note: Addresses #171

The Issue

The situation at present is that the docker run command does not include a restart policy parameter. As a result, if for any reason, the containers go down, they need to be started manually and the init.sh scripts need to be executed manually.

The potential reasons of why containers stop:

  1. resources not appropriately allocated to the host machine
  2. updates to docker or underlying packages
  3. maintenance at the grid site (reboot of machines, power outage etc.)

2 and 3 are related to restarting or resetting of the Docker daemon. One way to prevent containers from stopping in case the docker daemon suffers a restart it to configure the daemon with a live-restore policy as described here: https://docs.docker.com/config/containers/live-restore/ The downsides of live-restore are that the containers cannot be detached from the daemon for a long time as that could lead to buffer overflows for their log/data dumps. Also, if the daemon upgrades a separated by a few releases, it could prevent the containers from coming back up.

I have tested how live restore works with swarm mode and the results have not been great. I have a HTCondor cluster that is up. I set up live restore on a swarm worker while I was pinging the container at its overlay network IP address from the CE container (swarm manager). On the WN, I then restart the docker daemon, which failed to come up. Looking at /var/log/messages I see

Mar  5 19:33:48 simple-lc03 dockerd: time="2020-03-05T19:33:48.003271474+01:00" level=fatal msg="Error starting cluster component: --live-restore daemon configuration is incompatible with swarm mode"

The ping, of course, did not receive any response from the container verifying that the container is down and live-restore did not keep it alive after we stopped the docker daemon.

Current Workaround

We disable yum auto-updates for docker and ensure the yum repo for docker is absent during the start of the installation process We add a restart-policy flag to the docker run command We modify the init.sh to become a systemd managed service so that all the configuration comes back up anytime the containers restart.

Outcome

We reduce the possibility of containers restarting due to updates of the docker daemon. No manual operation on site admin's part to bring the containers back up.

Downside

Jobs will be lost if the docker daemon restarts as the container are still managed by the daemon.