This is NOT zero downtime

SponsorAds commented 1 month ago

Just as an info, especially for new people coming from google: This implementation does not provide zero downtime at all. It might "work" for very simple and often request based applications (e.g. php) but will fail in most other situations. But so will docker compose up --build.

The reason is quite simple: While you start up two containers (blue and green) and then stop the currently active container, you will have a load balancing issue while the currently active container is being stopped. Traefik will route requests to the container while docker is in the process of stopping it, resulting in errors (e.g. a nodejs process inside the container is already stopped, but traefik still sees the container).

This issue is in fact not solvable with docker/compose but only with swarm/k8s. Swarm gives you the ability to change labels dynamically, e.g. setting the priority/weight of the container before stopping it.

maxcountryman commented 1 month ago

Did you read the blog post? The "issue" you point out is addressed and solved by this implementation and this is in fact zero downtime.

SponsorAds commented 1 month ago

How is it solved? Did you actually test it?

Let me repeat slowly:

You start two containers
You stop one container
This stopping container is still known to the proxy
The proxy will keep routing 50% of the traffic to it while the container is stopping

How did this solve that issue?

Funnily enough you write the following in the comment Set Traefik priority label to 0 on the old service and stop the old environment if it was previously running

That would in fact solve it a little better - still not 100%. But you do NOT set the priority anywhere but simply stopping the container, resulting in failed requests. You also can't as compose has no dynamic labels and traefik no api for it. Also: setting priority to 0 defaults traefik to default priority. There is no priority 0. So even with that solution (as in swarms dynamic labels) it would still not be 0 down time on higher loads.

Tip: set up a small compose file with a service that takes longer to stop and start. That can be easily done by e.g. exposing a high port range as the (de-)allocation takes some time. See how your node process inside your container is already killed (failed requests), but the container takes easily 1-2 minutes to actually stop and disappear from traefik. As I said: Your "solution" works for very simple projects which can start/stop in 1s. But even then it is objectively not 0 down time.

maxcountryman commented 1 month ago

Read the article.

maxcountryman / aquamarine

This is NOT zero downtime #2