Docker Swarm with multiple replicas not stopping cleanly on Server 2019

matt-taranto commented 5 years ago

Description Hi there, We've just started trying to migrate from Docker on Windows Server 2016, to Server 2019, and have hit some issues trying to standup and teardown stacks cleanly. I'm sure we're just doing something wrong, as this was all just working on 2016, so some guidance would be much appreciated. We've tried running this on hosts running on VMWare, and on AWS, and are able to reliably reproduce with only 1 node in the swarm.

Steps to reproduce the issue: All OS Setup instructions followed:

Install windows 2019 Datacenter (Desktop Experience)
Install VMWare Tools, reboot
Apply all oustanding OS patches, rebooting between each batch
Remove WIndows Defender feature, reboot
Install Docker (from powershell): Install-PackageProvider -Name NuGet -RequiredVersion "2.8.5.208" -Force Install-Module DockerMsftProvider -RequiredVersion "1.0.0.6" -Force Install-Package Docker -ProviderName DockerMsftProvider -RequiredVersion "18.09.4" -Force
Reboot
Create a swarm docker swarm init --advertise-addr="192.168.1.1"
Deploy a stack docker stack deploy --compose-file ".\repro.yml" stack Minimal reproducible compose file pasted below. We're running about 20 services normally, and they all have infinite powershell scripts that update xml configuration, and then start a service and waits in a loop indefinitely. I've digest pinned to make sure that the same base container can be used to repro.
```
version: '3.7'
services:
servercore:
# image: mcr.microsoft.com/windows/servercore:ltsc2019
image: mcr.microsoft.com/windows/servercore@sha256:a687a87c19b5be715817a237a6a0772d0c7caf74631f99fd35a9633c3568b5c4
deploy:
  mode: replicated
  replicas: 4
  endpoint_mode: dnsrr
entrypoint: powershell while($$true){start-sleep -seconds 30}
networks:
 - network
```

networks: network: external: false

9. Wait until all replicas are online
10. Destroy the stack
```docker stack rm stack```
11. Check running containers after a little while (these will stay stuck indefinetely, even after waiting half an hour. This time only one got stuck, but I've seen n-1 replicas get stuck):

PS C:\Users\Administrator\Desktop\Docker> docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 754d3f77cdc6 mcr.microsoft.com/windows/servercore "powershell while($t…" About a minute ago Exited (3221225786) 29 seconds ago stack_servercore.3.c6mxlecnidh0ut9tmyjqgtzyt


**Describe the results you received:**
 - generally, only the first replica will stop correctly
 - remaining replicas will get stuck in the exited state

**Describe the results you expected:**
 - all replicas stopped and removed

**Additional information you deem important (e.g. issue happens only occasionally):**
Even with manual cleanup between each attempt, e.g. docker rm $(docker ps -a -q), after a number of attempts the subnet IP pool seems to be exhausted, and any attempts to stand up subsequent stacks results in another error, "Pool overlaps with other one on this address space".
I can get logs from reproducing this issue as well, it just takes an hour or so of stopping and starting to trigger.

**Output of `docker version`:**

Client: Docker Engine - Enterprise Version: 18.09.4 API version: 1.39 Go version: go1.10.8 Git commit: c3516c43ef Built: 03/27/2019 18:22:15 OS/Arch: windows/amd64 Experimental: false

Server: Docker Engine - Enterprise Engine: Version: 18.09.4 API version: 1.39 (minimum version 1.24) Go version: go1.10.8 Git commit: c3516c43ef Built: 03/27/2019 18:20:29 OS/Arch: windows/amd64 Experimental: false


**Output of `docker info`:**

Containers: 1 Running: 0 Paused: 0 Stopped: 1 Images: 1 Server Version: 18.09.4 Storage Driver: windowsfilter Windows: Logging Driver: json-file Plugins: Volume: local Network: ics l2bridge l2tunnel nat null overlay transparent Log: awslogs etwlogs fluentd gelf json-file local logentries splunk syslog Swarm: active NodeID: ghb4x7uuse2l8ps1kfpknuxwm Is Manager: true ClusterID: dlzrmcaus247hjudwu37h340n Managers: 1 Nodes: 1 Default Address Pool: 10.0.0.0/8 SubnetSize: 24 Orchestration: Task History Retention Limit: 5 Raft: Snapshot Interval: 10000 Number of Old Snapshots to Retain: 0 Heartbeat Tick: 1 Election Tick: 10 Dispatcher: Heartbeat Period: 5 seconds CA Configuration: Expiry Duration: 3 months Force Rotate: 0 Autolock Managers: false Root Rotation In Progress: false Node Address: 10.60.25.181 Manager Addresses: 10.60.25.181:2377 Default Isolation: process Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434) Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.379) OSType: windows Architecture: x86_64 CPUs: 4 Total Memory: 15.54GiB Name: EC2AMAZ-P4KR443 ID: FOBI:ZUHN:H7RY:LGIE:S2YH:IG2I:TVC4:IVKJ:LJES:LOQK:RQ2N:EYGN Docker Root Dir: C:\ProgramData\docker Debug Mode (client): false Debug Mode (server): true File Descriptors: -1 Goroutines: 168 System Time: 2019-03-31T23:43:55.6110218Z EventsListeners: 0 Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false



**Additional environment details (AWS, VirtualBox, physical, etc.):**
Reproducible on the latest patched version of windows server 2019, on VMWare and AWS

matt-taranto commented 5 years ago

log.txt

matt-taranto commented 5 years ago

Any feedback/suggestions around this one? Happy to try anything, get more logs etc to help run it down.

david-gonzalez-pnw commented 5 years ago

I can confirm @matt-taranto's issue and his logs. The same symptom occurs with deploying multiple services on a windows worker node.

The stack rm doesn't remove all containers from the node
Trying to redeploy the stack causes exceptions due to conflicts in network (because of orphaned network from step 1 above)
Any docker cmd done against the orphaned docker resources on the windows worker time-out
Only way to recover the windows worker to a usable state is by restarting docker and at times the only way to recover is by restarting the windows worker

PS: Sorry I don't have a specific yml or exception message above, we spent a couple of weeks on this issue trying different things and it wasn't till I found this thread that I switched the worker node to be on Windows Server ltsc2016 and everything started working so we had to make up for lost time and I didn't gather any of the failure artifacts. The yml is easy to recreate, as Matt has indicated above.

McKeownDesign commented 4 years ago

Same issue for me, rhel 7 master windows server 2019 worker.

Exact same functionality as described above.

moby / moby

Docker Swarm with multiple replicas not stopping cleanly on Server 2019 #38981