Feature request: Apply labels after a service is healthy

sowinski commented 3 months ago

Hi,

currently I try to implement rolling deployments with docker. (Using docker swarm).

When I start a new container, the traffic is immediately forwarded to the newest container, even if it is still "booting up". I would like to continue to forward the traffic to the old container until the new one is "healty".

It would be nice, if we could change this kind of behavior with a label.

smaccona commented 2 months ago

@sowinski we have this issue too - once we issue a docker service update, new containers start being created in our Swarm and some users see HTTP 502 or 503 errors because Caddy starts forwarding traffic to those new containers before they are ready. It would be great to have Caddy Docker Proxy take healthchecks into account when deciding whether an upstream is ready to receive traffic or not, and so it wouldn't apply the labels to the running Caddy configuration until the healthcheck passes.

Even a more rudimentary approach like a manually-configured delay would be acceptable if not ideal (you could run into a situation where all of the new containers were available and ready before the delay expires, which means Docker would have removed all of the old containers and so the service would be unavailable). This might actually be harder to implement anyway since a task's health status is (probably?) exposed via the Docker API.

@lucaslorentz any thoughts on this?

smaccona commented 1 month ago

How to reproduce

I decided to put together a simple deployment which illustrates this in a Docker Swarm environment. Pre-requisites: a Docker Swarm setup (doesn't matter how many servers). In my environment, Caddy is running on every node and Caddy's network is qr-caddy (this information doesn't really matter; it is just provided so the files below make sense).

First, our YAML file for the deployment. I used a stock Docker HTTP "Hello World" image based on busybox (see https://github.com/crccheck/docker-hello-world) which has a built-in health check (it just does nc -z localhost $PORT). In the examples below, I substituted health.example.com instead of the actual domain I tested with, and I also modified the command to include a sleep 20 on container start which will make it easier to test updating the server.

version: '3.3'

services:
  http_pause:
    image: crccheck/hello-world
    command: sh -c 'sleep 20 ; echo "httpd started" && trap "exit 0;" TERM INT; httpd -v -p 8000 -h /www -f & wait'
    deploy:
      labels:
        caddy: health.example.com
        caddy.reverse_proxy: "{{upstreams 8000}}"
    networks:
      - qr-caddy

networks:
  qr-caddy:
    external: true

I deployed this using docker stack deploy -c caddy_health.yaml caddy_health, and after it was deployed I scaled it to two replicas using docker service scale caddy_health_http_pause=2. I watched the Caddy logs to see the TLS cert was obtained, and then confirmed I was able to reach the service:

$ curl -I https://health.example.com
HTTP/2 200 
accept-ranges: bytes
alt-svc: h3=":443"; ma=2592000
content-type: text/html
date: Wed, 15 May 2024 21:47:31 GMT
etag: "618f3cac-1b7"
last-modified: Sat, 13 Nov 2021 04:18:52 GMT
server: Caddy
content-length: 439

$ curl https://health.example.com
<pre>
Hello World

                                       ##         .
                                 ## ## ##        ==
                              ## ## ## ## ##    ===
                           /""""""""""""""""\___/ ===
                      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
                           \______ o          _,/
                            \      \       _,'
                             `'--.._\..--''
</pre>

Now for the test. On my workstation, I ran 10,000 queries (in 10 parallel batches) against the service, and while that was running, forced the service to update/redeploy. I captured only the HTTP status code from curl, and then piped that to uniq -c to count how many instances of each status code I obtained. Here's the workstation command line:

$ seq 1 10000 | xargs -n 1 -P 10 -I {} curl -s -o /dev/null -w "%{http_code}\n" https://health.example.com | uniq -c

And while that's running, I do the following on the server:

# docker service update --force caddy_health_http_pause
caddy_health_http_pause
overall progress: 2 out of 2 tasks 
1/2: running   [==================================================>] 
2/2: running   [==================================================>] 
verify: Service converged

The update took about a minute to converge, because docker waits for each replica to become healthy before proceeding (even though my faked startup delay is 20 seconds, it takes longer than that because of the default interval for health checks). Here are the results from the uniq -c step (I did have to tally them up because it split the output into two batches):

9919 200
   1 000
  80 502

The 000 is a TLS failure which presumably occurs when the upstream service is killed while a request is in progress - I'm not worried about this, because we can reconfigure the service to finish responding to a request before shutting down. The one that concerns me is the 80 502s, which means the upstream service gave an invalid response (presumably because it was still starting up). Frankly, I expected this number to be higher because of the 20 seconds fake delay, so maybe there is something missing from this equation, but this failure rate is ~1% (and was higher than that in some of the other runs I did) and thus is significant.

I will continue to iterate on this example setup, but at the moment it does look like this is a valid breaking use case. I can't imagine too many people have a "set it and forget it" setup where they publish a service once and never have to update it - in our case, we often push updates daily, and often multiple times per day, so we will encounter this at least daily.

lucaslorentz / caddy-docker-proxy

Feature request: Apply labels after a service is healthy #595

How to reproduce