jittering / traefik-kop

A dynamic docker->redis->traefik discovery agent
MIT License
179 stars 13 forks source link

Not working with containers that have healthchecks #9

Closed Ramblurr closed 2 years ago

Ramblurr commented 2 years ago

traefik-kop works very well in my self-hosted setup. Many thanks for this tool.

However recently I started adding healthchecks to my containers and traefik-kop doesn't seem to play well with that.

With debug enabled I see the log events

time="2022-04-25T14:56:29Z" level=debug msg="Provider event received {Status:start ID:000c8368443c4b636486602894299be6ac6db31cc19e3df91f76521c6a8ef235 From:docker.io/photostructure/server:beta Type:container Action:start Actor:{ID:000c8368443c4b636486602894299be6ac6db31cc19e3df91f76521c6a8ef235 Attributes:map[containerExitCode:0 image:docker.io/photostructure/server:beta <LABELS OMITTED>]} Scope:local Time:1650898589 TimeNano:1650898589195815495}" providerName=docker

time="2022-04-25T14:56:29Z" level=debug msg="Filtering unhealthy or starting container" providerName=docker container=photostructure-000c8368443c4b636486602894299be6ac6db31cc19e3df91f76521c6a8ef235

Basically it seems traefik-kop is detecting the container start, but at the beginning the container is in the starting state, and after some seconds gets promoted to the healthy state. But traefik-kop is never picking up the state transition from starting -> healthy.

If I restart traefik-kop after the container is healthy, traefik-kop detects it just fine.

FWIW I am running this with podman, not docker (by mounting the podman socket into the container as you would with docker). This has worked fine thus far.

chetan commented 2 years ago

@Ramblurr thanks for the report, I'll give it a look. Do you have a sample healthcheck config that you're using? Perhaps something simple that I can reproduce?

chetan commented 2 years ago

@Ramblurr I just tested with a simple nginx healthcheck and it appears to be working for me. Please post more details of a specific config that's not working for you, if you can.

Here's the output when I bounced nginx while adding the healthcheck:

time="2022-05-14T09:43:25-04:00" level=debug msg="Provider event received {Status:die ID:a9648228a9f576f6674a70eebfb01dda2f8f2aea933820f95b7b7eb6a3c5e292 From:nginx:alpine Type:container Action:die Actor:{ID:a9648228a9f576f6674a70eebfb01dda2f8f2aea933820f95b7b7eb6a3c5e292 Attributes:map[com.docker.compose.config-hash:d76290a0a499399dfea874d74860481f1d4b90b44fcb1f8862008c6eaca5d33b com.docker.compose.container-number:1 com.docker.compose.depends_on: com.docker.compose.image:sha256:51696c87e77e4ff7a53af9be837f35d4eacdb47b4ca83ba5fd5e4b5101d98502 com.docker.compose.oneoff:False com.docker.compose.project:testing com.docker.compose.project.config_files:/home/chetan/src/jitter/traefik-kop/testing/docker-compose.yml com.docker.compose.project.working_dir:/home/chetan/src/jitter/traefik-kop/testing com.docker.compose.service:nginx com.docker.compose.version:2.5.0 desktop.docker.io/wsl-distro:Ubuntu exitCode:0 image:nginx:alpine maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> name:testing-nginx-1 traefik.enable:true traefik.http.routers.nginx.rule:Host(`nginx.local`) traefik.http.routers.nginx.tls:true traefik.http.routers.nginx.tls.certresolver:default traefik.http.services.nginx.loadbalancer.server.port:8088 traefik.http.services.nginx.loadbalancer.server.scheme:http]} Scope:local Time:1652535805 TimeNano:1652535805221622200}" providerName=docker
time="2022-05-14T09:43:25-04:00" level=debug msg="Filtering disabled container" container=redis-testing-531009b8196683a8923f15fbcee525b42dd8d2c821730bc3a0a562e5166b0611 providerName=docker
time="2022-05-14T09:43:25-04:00" level=debug msg="Configuration received from provider docker: {\"http\":{},\"tcp\":{},\"udp\":{}}" providerName=docker
time="2022-05-14T09:43:25-04:00" level=info msg="refreshing configuration"
time="2022-05-14T09:43:25-04:00" level=debug msg="removing keys from http_routers: nginx@docker"
time="2022-05-14T09:43:25-04:00" level=debug msg="removing keys matching traefik/http/routers/nginx/*"
time="2022-05-14T09:43:25-04:00" level=debug msg="removing keys from http_services: nginx@docker"
time="2022-05-14T09:43:25-04:00" level=debug msg="removing keys matching traefik/http/services/nginx/*"
time="2022-05-14T09:43:27-04:00" level=debug msg="Provider event received {Status:start ID:a05e3e82affc943e25cecba7ff12200ccaab2699abd48a0612a1813f19f6f7e8 From:nginx:alpine Type:container Action:start Actor:{ID:a05e3e82affc943e25cecba7ff12200ccaab2699abd48a0612a1813f19f6f7e8 Attributes:map[com.docker.compose.config-hash:5c5953e735074124d627a9c9d4609bc59991869468a84caa1af8a9c7a3203312 com.docker.compose.container-number:1 com.docker.compose.depends_on: com.docker.compose.image:sha256:51696c87e77e4ff7a53af9be837f35d4eacdb47b4ca83ba5fd5e4b5101d98502 com.docker.compose.oneoff:False com.docker.compose.project:testing com.docker.compose.project.config_files:/home/chetan/src/jitter/traefik-kop/testing/docker-compose.yml com.docker.compose.project.working_dir:/home/chetan/src/jitter/traefik-kop/testing com.docker.compose.service:nginx com.docker.compose.version:2.5.0 desktop.docker.io/wsl-distro:Ubuntu image:nginx:alpine maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> name:testing-nginx-1 traefik.enable:true traefik.http.routers.nginx.rule:Host(`nginx.local`) traefik.http.routers.nginx.tls:true traefik.http.routers.nginx.tls.certresolver:default traefik.http.services.nginx.loadbalancer.server.port:8088 traefik.http.services.nginx.loadbalancer.server.scheme:http]} Scope:local Time:1652535807 TimeNano:1652535807463668400}" providerName=docker
time="2022-05-14T09:43:27-04:00" level=debug msg="Filtering unhealthy or starting container" providerName=docker container=nginx-testing-a05e3e82affc943e25cecba7ff12200ccaab2699abd48a0612a1813f19f6f7e8
time="2022-05-14T09:43:27-04:00" level=debug msg="Filtering disabled container" providerName=docker container=redis-testing-531009b8196683a8923f15fbcee525b42dd8d2c821730bc3a0a562e5166b0611
time="2022-05-14T09:43:27-04:00" level=debug msg="Configuration received from provider docker: {\"http\":{},\"tcp\":{},\"udp\":{}}" providerName=docker
time="2022-05-14T09:43:27-04:00" level=info msg="Skipping same configuration" providerName=docker
time="2022-05-14T09:43:29-04:00" level=debug msg="Provider event received {Status:health_status: healthy ID:a05e3e82affc943e25cecba7ff12200ccaab2699abd48a0612a1813f19f6f7e8 From:nginx:alpine Type:container Action:health_status: healthy Actor:{ID:a05e3e82affc943e25cecba7ff12200ccaab2699abd48a0612a1813f19f6f7e8 Attributes:map[com.docker.compose.config-hash:5c5953e735074124d627a9c9d4609bc59991869468a84caa1af8a9c7a3203312 com.docker.compose.container-number:1 com.docker.compose.depends_on: com.docker.compose.image:sha256:51696c87e77e4ff7a53af9be837f35d4eacdb47b4ca83ba5fd5e4b5101d98502 com.docker.compose.oneoff:False com.docker.compose.project:testing com.docker.compose.project.config_files:/home/chetan/src/jitter/traefik-kop/testing/docker-compose.yml com.docker.compose.project.working_dir:/home/chetan/src/jitter/traefik-kop/testing com.docker.compose.service:nginx com.docker.compose.version:2.5.0 desktop.docker.io/wsl-distro:Ubuntu image:nginx:alpine maintainer:NGINX Docker Maintainers <docker-maint@nginx.com> name:testing-nginx-1 traefik.enable:true traefik.http.routers.nginx.rule:Host(`nginx.local`) traefik.http.routers.nginx.tls:true traefik.http.routers.nginx.tls.certresolver:default traefik.http.services.nginx.loadbalancer.server.port:8088 traefik.http.services.nginx.loadbalancer.server.scheme:http]} Scope:local Time:1652535809 TimeNano:1652535809813675000}" providerName=docker
time="2022-05-14T09:43:29-04:00" level=debug msg="Filtering disabled container" providerName=docker container=redis-testing-531009b8196683a8923f15fbcee525b42dd8d2c821730bc3a0a562e5166b0611
time="2022-05-14T09:43:29-04:00" level=debug msg="Configuration received from provider docker: {\"http\":{\"routers\":{\"nginx\":{\"service\":\"nginx\",\"rule\":\"Host(`nginx.local`)\",\"tls\":{\"certResolver\":\"default\"}}},\"services\":{\"nginx\":{\"loadBalancer\":{\"servers\":[{\"url\":\"http://172.18.0.2:8088\"}],\"passHostHeader\":true}}}},\"tcp\":{},\"udp\":{}}" providerName=docker
time="2022-05-14T09:43:30-04:00" level=debug msg="No entryPoint defined for this router, using the default one(s) instead: []" routerName=nginx
time="2022-05-14T09:43:30-04:00" level=info msg="refreshing configuration"
time="2022-05-14T09:43:30-04:00" level=debug msg="found http service: nginx@docker"
time="2022-05-14T09:43:30-04:00" level=debug msg="found router nginx for service nginx"
time="2022-05-14T09:43:30-04:00" level=debug msg="found container '/testing-nginx-1' (a05e3e82affc943e25cecba7ff12200ccaab2699abd48a0612a1813f19f6f7e8) for service 'nginx'"
time="2022-05-14T09:43:30-04:00" level=debug msg="using explicitly set port 8088 for nginx"
time="2022-05-14T09:43:30-04:00" level=debug msg="writing traefik/http/routers/nginx/rule = Host(`nginx.local`)"
time="2022-05-14T09:43:30-04:00" level=debug msg="writing traefik/http/services/nginx/loadBalancer/servers/0/url = http://172.28.183.97:8088"
time="2022-05-14T09:43:30-04:00" level=debug msg="writing traefik/http/services/nginx/loadBalancer/passHostHeader = true"
time="2022-05-14T09:43:30-04:00" level=debug msg="writing traefik/http/routers/nginx/tls/certResolver = default"
time="2022-05-14T09:43:30-04:00" level=debug msg="writing traefik/http/routers/nginx/service = nginx"
mrtnbr commented 2 years ago

I am seeing this same issue with my pihole container:

traefik-kop  | time="2022-06-18T22:47:57Z" level=info msg="creating new redis store at 192.168.0.89:6379 for hostname traefik-kop"
traefik-kop  | time="2022-06-18T22:47:57Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
traefik-kop  | time="2022-06-18T22:47:57Z" level=info msg="Starting provider *docker.Provider {\"watch\":true,\"endpoint\":\"unix:///var/run/docker.sock\",\"swarmModeRefreshSeconds\":\"15s\"}"
traefik-kop  | time="2022-06-18T22:47:57Z" level=debug msg="Provider connection established with docker 20.10.17 (API 1.41)" providerName=docker
traefik-kop  | time="2022-06-18T22:47:57Z" level=debug msg="Filtering unhealthy or starting container" providerName=docker container=pihole3-02-system-867e02fe43830bb4f87bc0f091de639e2ee842df041974bca811ecdfe2c60a61

If I wait a bit and restart the traefik-kop container, it starts working.

chetan commented 2 years ago

@mrtnbr Any chance you can provide your complete docker config for both pihole and kop?

mrtnbr commented 2 years ago

@mrtnbr Any chance you can provide your complete docker config for both pihole and kop?

This is the docker-compose I use. Hope this helps.

version: "2.4"

networks:
  docker_vlan:
    external: true
  default:
    external: true
    name: docker_vlan

services:
  pihole3:
    image: pihole/pihole:latest
    container_name: pihole3
    hostname: pihole3
    restart: always
    ports:
      - 80:80
    networks:
      docker_vlan:
        ipv4_address: ${PIHOLE3_IP}
    mac_address: ${PIHOLE3_MAC}
    environment:
      - TZ=***redacted***
      - ServerIP=${PIHOLE3_IP}
      - CORS_HOSTS=***redacted***
      - VIRTUAL_HOST=***redacted***
      - WEBPASSWORD=***redacted***
      - DNS1=1.1.1.1
      - DNS2=1.0.0.1
      - DNS_FQDN_REQUIRED=false #for WG DNS
      - DNS_BOGUS_PRIV=true
      - REV_SERVER=true
      - REV_SERVER_DOMAIN=***redacted***
      - REV_SERVER_TARGET=192.168.0.1
      - REV_SERVER_CIDR=192.168.0.0/24
    volumes:
      - ${VOL_DIR}/pihole3/etc-pihole/:/etc/pihole
      - ${VOL_DIR}/pihole3/etc-dnsmasq.d/:/etc/dnsmasq.d
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.pihole3.rule=Host(`pihole3.${DOMAINNAME}`)"
      - "traefik.http.routers.pihole3.tls=true"
      - "traefik.http.routers.pihole3.tls.certresolver=le"
      - "traefik.http.routers.pihole3.middlewares=pihole-add-prefix@docker,secureheaders@docker,ipwhitelist@docker"
      - "traefik.http.services.pihole3.loadbalancer.server.port=80"
      - "traefik.docker.network=docker_vlan" #for traefik-kop

  traefik-kop:
    image: ghcr.io/jittering/traefik-kop:latest
    container_name: traefik-kop
    hostname: traefik-kop
    restart: always
    networks:
      docker_vlan:
        ipv4_address: ${TRAEFIK_KOP_IP}
    mac_address: ${TRAEFIK_KOP_MAC}
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - TZ=***redacted***
      - HOSTNAME=***redacted***
      - REDIS_ADDR=192.168.0.89:6379
      - VERBOSE=true
chetan commented 2 years ago

@mrtnbr thanks, this helps a lot, but unfortunately I am not able to totally reproduce your issue. It seems the pihole container has a healthcheck specified in the Dockerfile and takes a bit longer to come up healthy than my earlier tests with nginx.

I'm using this basic pihole config (shouldn't matter much compared to yours):

  pihole:
    image: "pihole/pihole:latest"
    restart: unless-stopped
    ports:
      - 8089:80
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.pihole.rule=Host(`pihole.local`)"
      - "traefik.http.routers.pihole.tls=true"
      - "traefik.http.routers.pihole.tls.certresolver=default"
      - "traefik.http.services.pihole.loadbalancer.server.scheme=http"
      - "traefik.http.services.pihole.loadbalancer.server.port=8089"

And this is the kop output (kop already running in a console, dc up pihole in terminal 2):

time="2022-06-21T20:31:41-04:00" level=debug msg="Provider event received {Status:start ... From:pihole/pihole:latest Type:container Action:start ... }" providerName=docker
time="2022-06-21T20:31:41-04:00" level=debug msg="Filtering unhealthy or starting container" providerName=docker container=pihole-testing-8f29a392f221a3c97e2936df5c8908068f31442e7d0220f3ec831e8683fa57c4
time="2022-06-21T20:31:41-04:00" level=debug msg="Configuration received from provider docker: {\"http\":{},\"tcp\":{},\"udp\":{}}" providerName=docker
time="2022-06-21T20:31:41-04:00" level=info msg="Skipping same configuration" providerName=docker

# then 30 seconds later:

time="2022-06-21T20:32:11-04:00" level=debug msg="Provider event received {Status:health_status: healthy ... From:pihole/pihole:latest Type:container Action:health_status: healthy Actor: ... }" providerName=docker
time="2022-06-21T20:32:11-04:00" level=debug msg="Configuration received from provider docker: {\"http\":{\"routers\":{\"pihole\":{\"service\":\"pihole\",\"rule\":\"Host(`pihole.local`)\",\"tls\":{\"certResolver\":\"default\"}}},\"services\":{\"pihole\":{\"loadBalancer\":{\"servers\":[{\"url\":\"http://172.20.0.2:8089\"}],\"passHostHeader\":true}}}},\"tcp\":{},\"udp\":{}}" providerName=docker
time="2022-06-21T20:32:11-04:00" level=debug msg="No entryPoint defined for this router, using the default one(s) instead: []" routerName=pihole

time="2022-06-21T20:32:11-04:00" level=info msg="refreshing configuration"
time="2022-06-21T20:32:11-04:00" level=debug msg="found http service: pihole@docker"
time="2022-06-21T20:32:11-04:00" level=debug msg="found router 'pihole@docker' for service pihole"
time="2022-06-21T20:32:11-04:00" level=debug msg="found container '/testing-pihole-1' (8f29a392f221a3c97e2936df5c8908068f31442e7d0220f3ec831e8683fa57c4) for service 'pihole'"
time="2022-06-21T20:32:11-04:00" level=debug msg="no network label set for pihole@docker"
time="2022-06-21T20:32:11-04:00" level=debug msg="found router 'pihole@docker' for service pihole"
time="2022-06-21T20:32:11-04:00" level=debug msg="found container '/testing-pihole-1' (8f29a392f221a3c97e2936df5c8908068f31442e7d0220f3ec831e8683fa57c4) for service 'pihole'"
time="2022-06-21T20:32:11-04:00" level=debug msg="overriding service port from container host-port: using 8089 (was 8089) for pihole@docker"
time="2022-06-21T20:32:11-04:00" level=debug msg="writing traefik/http/services/pihole/loadBalancer/passHostHeader = true"
time="2022-06-21T20:32:11-04:00" level=debug msg="writing traefik/http/routers/pihole/tls/certResolver = default"
time="2022-06-21T20:32:11-04:00" level=debug msg="writing traefik/http/routers/pihole/service = pihole"
time="2022-06-21T20:32:11-04:00" level=debug msg="writing traefik/http/routers/pihole/rule = Host(`pihole.local`)"
time="2022-06-21T20:32:11-04:00" level=debug msg="writing traefik/http/services/pihole/loadBalancer/servers/0/url = http://192.168.80.99:8089"

Can you confirm if you are running docker engine or podman? Perhaps podman isn't* handling healthchecks in the same way as docker engine?

chetan commented 2 years ago

Just tested with podman and am able to reproduce this issue. The container is flagged as unhealthy/starting up and the followup event never fires. Not sure if this is a bug in podman or traefik..

chetan commented 2 years ago

Relevant issue https://github.com/traefik/traefik/issues/8833

mrtnbr commented 2 years ago

I use docker engine.

username@hostname:~$ docker --version
Docker version 20.10.17, build 100c701
username@hostname:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:    11
Codename:   bullseye
chetan commented 2 years ago

@Ramblurr @mrtnbr just pushed a fix for this issue using a fallback polling mechanism. See the release note here.

mrtnbr commented 2 years ago

@Ramblurr @mrtnbr just pushed a fix for this issue using a fallback polling mechanism. See the release note here.

Thanks, this seems to do what it needs to do :)