qdm12 / deunhealth

Binary program to restart unhealthy Docker containers
MIT License
117 stars 8 forks source link

Bug: Issue restarting containers using network in other stack #11

Open EzekialSA opened 3 years ago

EzekialSA commented 3 years ago

I'm trying to configure everything to be automated with updates and availability using watch tower and deunhealth. I was doing testing to see what would happen if gluetun got an update (as you know it breaks things connected to it when it restarts). I get the following errors when stopping/restarting gluetun:

2021/10/20 12:17:21 ERROR failed restarting container: Error response from daemon: Cannot restart container qbittorrent: No such container: 5bc959037ff8fceeca8dfae013347f64162fa759189421d224f07a31810f3aaf,
2021/10/20 12:17:18 INFO container qbittorrent (image ghcr.io/linuxserver/qbittorrent:latest) is unhealthy, restarting it...

I believe that the gluetun container is the one that's referenced by that hash, so it disappears and deunhealth doesn't know how to handle it.

I don't think it's worth noting, but I am using portainer for stack management. Here are my config files of what I'm trying to do:

version: "2.1"
services:
  qbittorrent:
    image: ghcr.io/linuxserver/qbittorrent:latest
    container_name: qbittorrent
    labels:
      - com.centurylinklabs.watchtower.scope=WEEKDAYS
      - deunhealth.restart.on.unhealthy=true
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Europe/London
      - WEBUI_PORT=8095
      - UMASK=002
    healthcheck:
      test: "curl -sf -o /dev/null example.com || exit 1"
      interval: 1m
      timeout: 10s
      retries: 2
    restart: unless-stopped
    network_mode: "container:gluetun"
---
version: "3"
services:
  gluetun:
    image: qmcgaw/gluetun
    container_name: gluetun
    labels:
      - com.centurylinklabs.watchtower.scope=WEEKDAYS
      - deunhealth.restart.on.unhealthy=true
    cap_add:
      - NET_ADMIN
    ports:
      - 8888:8888/tcp # HTTP proxy
      - 8388:8388/tcp # Shadowsocks
      - 8388:8388/udp # Shadowsocks
      - 6881:6881/tcp
      - 6881:6881/udp
      - 8095:8095/tcp
    volumes:
      - /yes/config/gluetun:/gluetun
    environment:
      - VPNSP=nordvpn
      - REGION=United States
      - UPDATE_PERIOD=24h
    restart: unless-stopped
---
version: "3.7"
services:
  deunhealth:
    image: qmcgaw/deunhealth
    container_name: deunhealth
    labels:
      - com.centurylinklabs.watchtower.scope=WEEKDAYS
      - deunhealth.restart.on.unhealthy=true
    network_mode: "none"
    environment:
      - LOG_LEVEL=info
      - HEALTH_SERVER_ADDRESS=127.0.0.1:9999
      - TZ=America/New_York
    restart: always
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
---
version: "3"
services:
  watchtower:
    image: containrrr/watchtower
    container_name: watchtower
    labels:
      - com.centurylinklabs.watchtower.scope=WEEKDAYS
      - deunhealth.restart.on.unhealthy=true
    environment:
      - WATCHTOWER_INCLUDE_RESTARTING=true
      - WATCHTOWER_CLEANUP=true
      - WATCHTOWER_REVIVE_STOPPED=true
      - WATCHTOWER_ROLLING_RESTART=true
      - TZ=America/New_York
    command: --schedule "0 0 5 * * 1-5" --scope WEEKDAYS
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /etc/docker/daemon.json:/config.json
    restart: always
kubax commented 3 years ago

i seccond that... Thats exactly my problem...

i have disabled updates for gluetun to stop my containers from dangling without network.

if that is fixable, i would be very glad!!!

qdm12 commented 3 years ago

That's really strange. So the container can no longer be found with its container ID?! I'll do some more testing.

Meanwhile I'm almost done on a cascaded restart feature which should restart containers labeled for it when a certain container starts (like gluetun).

qdm12 commented 3 years ago

Ah got it. It's because the container ID it was relying on (gluetun) disappeared. Ugh, that's also going to be problematic for my cascaded restart feature... I think the (connected) container config needs to be patched somehow, before being restarted 🤔

qdm12 commented 3 years ago

Ok so after some research... There is no way to know what the 'vpn' container was since we just have its ID and it no longer exists (the name is not accessible). I guess it could stop it, but it wouldn't be able to start it again, so that's a bit pointless sadly.

Now on my cascaded restart feature, the idea is that you would put a label on the 'connected' containers indicating the container name of the 'vpn' container. That way, this is feasible. Writing out how it should do it (also for myself):

  1. Stream events and monitor every container starting
  2. For every start event (e.g. vpn starting), get all containers labeled with the name of the container starting
  3. For each container found:
    • If it is NOT a connected container, just restart it
    • If it is a connected container:
      1. Inspect it and get its entire configuration
      2. Extract the expired ID from this config
      3. Use the container ID from the container starting and replace the expired ID with it in the config
      4. Stop the container
      5. Start a new container using the patched config

I have bits and pieces of it ready, I just need to wire everything up and try it out, but it should work fine.

qdm12 commented 3 years ago

So... this previous suggestion, let's call it A, won't work if Deunhealth starts and a VPN container has already been shutdown/restarted and existing containers are disconnected, before deunhealth started. The only solution, call it B, I can think of is to use labels for both the VPN container and the connected containers and not rely on container names. For example have a unique label ID for the 'vpn' container, and use it for all the connected containers.

I also came up with another solution, let's call it C, which is also more complex to implement, only relying on container names (no label), although it has the same problem mentioned above. Here's how it would work (notes to myself as well):

  1. When Deunhealth start, gather all containers that are connected to another container, extract each of the 'vpn' container IDs, and find the corresponding container name for each of these IDs (assuming the VPN container is not gone yet)
  2. Stream events and monitor every start events.
    • Check if the container is container-connected. If it is, extract the 'vpn' container ID ➡️ get its name and keep a state of the id<->name mapping
    • Check if the container name is one of the VPN name from our id<->name mapping. If it is, find all now disconnected containers using the old id (using our mapping), patch all their configurations with the new ID and stop&start them. Update the mapping id<->name.

Solutions comparison

Solution Works on previously disconnected containers at start Works without label for VPN container Works without labels for VPN connected containers Does not need state
A ✔️ ✔️
B ✔️ ✔️
C ✔️ ✔️

Now what solution do you prefer 😄 ????

I'm leaning towards B to have something that works, although it requires more user fiddling.

EzekialSA commented 3 years ago

Personally I lean towards B as well. Involves more up front config with labels, but it allows for more verbosity with what is connected, forcing the user to make that link.

Solution A, Auto monitoring and logging container information isn't a terrific solution to me.

Solution C, dropping context of containers seems like too much effort, and could cause some issue if someone has multiple stacks with overlapping configured names over a cluster...bad practice, but could cause a headache for someone down the line.

kubax commented 3 years ago

I pick B. I was elected to lead, not to read! (SCNR)

Labels would be perfectly fine for me.

Also it sounds like a litle less work from your side, with the labels implementation.

oester commented 2 years ago

Another vote for option B.

lennvilardi commented 2 years ago

+1 for option B and do you know when it will be released ?

nlynzaad commented 2 years ago

+1 for option B

qdm12 commented 2 years ago

I'm working on it right now! Hopefully we will have something today :wink:

EDIT (2021-12-06): still working on it, it's a bit more convoluted than I expected code-spaghetti wise, but it's getting there!

qdm12 commented 2 years ago

Note if the 'network container' (aka the vpn) goes down and doesn't restart, there is no way to restart properly the connected containers since the label won't be anywhere unfortunately. I will make the program log it out as a warning if this happens.

kubax commented 2 years ago

i'm not sure if i got this right.

you are not able to restart the "child" containers, if the vpn server did kill itself and did not restart, right?

But if the container is updated and did restart without errors, that is still possible to fix with the intended patch?

lennvilardi commented 2 years ago

In my case I just need to recreate containers attached to the network container when recreated by watchtower. The network container is always up and running but the others containers are orphans and cannot be restarted.

lennvilardi commented 2 years ago

any eta ?

ahmaddxb commented 2 years ago

Has this been implemented yet?

sunbeam60 commented 2 years ago

A little late to the party here, but definitely also prefer option B and I'm very excited about this feature.

(yes, my gluetun container got updated by watchtower last night and now the whole stack is down 😄 )

qdm12 commented 2 years ago

Hello all, good news, I'm working again on this. Sorry for the immense delay I took to get back working on this. I have some 'new uncommited' code (from like 6 months ago lol) that looks promising, I'm hoping for a solution B implementation soon! :+1:

Manfred73 commented 2 years ago

Should this already be working in a current version combined with using deunhealth? I'm still using an older image of gluetun (v3.28.2) so it doesn't get automatically updated by watchtower. When it does get updated, connectivity to apps using gluetun is lost (https://github.com/qdm12/deunhealth/discussions/34). Or should I still manually update gluetun for now?

MajorLOL commented 1 year ago

Any update? :)

STRAYKR commented 1 year ago

I guess Quentin hasn't had time to implement the deunhealth.restart.on.unhealthy=true label yet, or else it's a more difficult task that initially thought? Doesn't work for me yet.

deunhealth log states 0 containers monitored, despite tagging several containers with deunhealth.restart.on.unhealthy=true

2023/08/04 10:44:19 INFO Monitoring 0 containers to restart when becoming unhealthy

I turn my mini-PC media server off every evening. So I've been able to use a shell script that does a docker compose down && docker compose up -d 2 mins after the server first boots up (Quentin recommends running similar as a workaround). This fixes my stack... at least for some hours. Sometimes something breaks, and if if that happens I just power it off and on again! Looking forward to a more robust solution :-)

NaturallyAsh commented 1 year ago

@STRAYKR Is your deun container in the same yml as gluetun? That was my issue. Logs showed "Monitoring 0 containers" when I added the label to gluetun but deun was in its own yml. When I moved deun to the same yml compose as gluetun and qbittorrent, deun registered the labels and started monitoring the containers. I'm thinking, for my case, that the issue might've been that deun couldn't reach gluetun because it wasn't on the same network.

nolimitech commented 10 months ago

Hello guys. It still doesn't work.

`2023/12/30 19:07:39 INFO container qbittorrent (image lscr.io/linuxserver/qbittorrent:latest) is unhealthy, restarting it... 2023/12/30 19:07:43 ERROR failed restarting container: Error response from daemon: Cannot restart container qbittorrent: No such container: 66cfe13371d1b10781c4a0649f96c8a82044f3852a2bbd77524c6f92b1902e35

2023/12/30 19:18:51 INFO container transmission (image lscr.io/linuxserver/transmission:latest) is unhealthy, restarting it... 2023/12/30 19:18:55 ERROR failed restarting container: Error response from daemon: Cannot restart container transmission: No such container: 72a8f02b433e0b443812be3a44171ece10b9cc6191b7d9bcba8fc6cdb012d125`

STRAYKR commented 10 months ago

@STRAYKR Is your deun container in the same yml as gluetun? That was my issue. Logs showed "Monitoring 0 containers" when I added the label to gluetun but deun was in its own yml. When I moved deun to the same yml compose as gluetun and qbittorrent, deun registered the labels and started monitoring the containers. I'm thinking, for my case, that the issue might've been that deun couldn't reach gluetun because it wasn't on the same network.

Hi @NaturallyAsh, sorry for the delayed response, yes, all config for deun and gluetun is in the same yml docker compose file, I only have the one docker compose file.

web3dopamine commented 9 months ago

hi guys Any update on this?

jaredbrogan commented 3 months ago

Alt Text

Just chiming in to keep this issue at least somewhat active. 😄