Docker stack deploy (redeploy) not updating running containers: image versions / envs on replicas after failing update for the first time

DBLaci commented 7 years ago

Description

I update my Swarm stacks by running: docker stack deploy --prune --compose-file xyz.yml --with-registry-auth

In most cases everything works fine. If I change env, or image versions the replicas are updated based on update-policy.

But sometimes (it's rare, less than 1 in 20 times) 1 of 2 replicas are not updated, altough the healthcheck is fine, and both replicas are running.

Steps to reproduce the issue:

Deploy stack with docker-compose
Change image tag to newer one in the compose file
Redeploy stack with: docker stack deploy --prune --compose-file xyz.yml --with-registry-auth

Describe the results you received:

Sometimes 1 of 2 replicas not updated to the new image.

Describe the results you expected:

Update all replica versions.

Additional information you deem important (e.g. issue happens only occasionally):

The image is hosted in Gitlab private registry, but I don't think it is an auth error. (in fact the outdated version is not even shutted down, so no update is even tried)

The compose has nothing special, but I can't copy the whole as it is private. It has 1 service defined in it this time, but this problem happens sometimes with other compose with more services. The deploy config:

    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 30s

If I kill the old instance on the specific node with docker kill, the right (updated) version will load.

25 minutes after the docker compose the outdated replica still not updated.

Output of docker version:

This is on all nodes of the swarm, but I experienced this problem with 17.06.0, but it is rare. This was the 3rd time when I report this.

Client:
 Version:      17.07.0-ce-rc1
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   8c4be39
 Built:        Wed Jul 26 03:46:39 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.07.0-ce-rc1
 API version:  1.31 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   8c4be39
 Built:        Wed Jul 26 03:45:32 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Containers: 47
 Running: 18
 Paused: 0
 Stopped: 29
Images: 265
Server Version: 17.07.0-ce-rc1
Storage Driver: btrfs
 Build Version: Btrfs v4.4
 Library Version: 101
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: XXXXXXXXXXXXXXX
 Is Manager: true
 ClusterID: XXXXXXXXXXXXXXX
 Managers: 3
 Nodes: 4
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false  
 Node Address: 10.4.3.140
 Manager Addresses:
  xxxxxxxx:2377
  xxxxxxxx:2377
  xxxxxxxx:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.8.0-53-generic   
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859GiB
Name: dliver-docker-0
ID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Docker Root Dir: /var/lib/docker   
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

It runs in cloudstack vps. Not every node has btrfs.

DBLaci commented 7 years ago

More info:

I changed the docker-compose file and environment variables and deployed it with docker stack deploy command. The service in the stack is running in 2 replicas. (image tag was not changed)
The stack deploy failed, because I made a mistake in the networks, so the stack deploy succeeded but the container didn't get in Running state because of the error. So the second replica was left intact. (this is normal and preferred behaviour because of the update policy: paralellism: 1)
I fixed the docker-compose and deployed it again. The first replica (was not able to start properly) got in to Running state! BUT the second replica is not restarted, so it is Running but with the old environment variables.
I killed the second replica with docker kill and it started with the new environments as intended.

I guess the problem is the service is updated on the deploy but not restarted because of the update policy and failing start, and then on the second deploy the outdated containers are never restarted. No matter what was changed (env, image tag).

samuel-miller commented 7 years ago

I have a very similar issue, when using deploy: mode: global

not all of my nodes get updated to the specified version in the compose file.

daanjipping commented 6 years ago

Also having this issue on 18.03.0-ce. Using 1 manager and 3 nodes. One container stuck on a single node doesn't update to new version when redeploying stack.

Client:
 Version:   18.03.0-ce
 API version:   1.37
 Go version:    go1.9.4
 Git commit:    0520e24
 Built: Wed Mar 21 23:06:22 2018
 OS/Arch:   darwin/amd64
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:  18.03.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   0520e24
  Built:    Wed Mar 21 23:08:36 2018
  OS/Arch:  linux/amd64
  Experimental: false

bitgandtter commented 6 years ago

Same issue here, randomly some containers does not update on a deploy

dshields4 commented 6 years ago

MIght this require "restart_policy: condition: any"? I was experiencing the same until I tried making that change..

sorenlorentzen commented 6 years ago

We're facing the same issue. Our swarm has 3 manager nodes and 6 worker nodes all running Server Version: 18.06.0-ce Our compose files do have restart_policy: condition: any

asokani commented 5 years ago

I experience the same problem on 18.09.3

Sometimes some containers are not updated on docker stack deploy

gabrielruiu commented 4 years ago

I'll throw my woes to the pot as well. It's an issue in our case too. It even happens when a service has just one replica, so it's basically like the update doesn't happen at all.

This is our setup:

# docker version
Client:
 Version:           v18.09.0
 API version:       1.39
 Go version:        go1.11.4
 Git commit:        v18.09.0
 Built:             unknown-buildtime
 OS/Arch:           linux/arm
 Experimental:      false

Server:
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.4
  Git commit:       
  Built:            Mon Nov 18 17:22:36 UTC 2019
  OS/Arch:          linux/arm
  Experimental:     false

moby / moby

Docker stack deploy (redeploy) not updating running containers: image versions / envs on replicas after failing update for the first time #34299