Closed djarvis closed 4 years ago
yep we just had same problem. Our services within the swarm couldn't communicate with each other either. :( :(
I had issues already with 2020-04 cumulative update.
Currently my swarm status is that containers using overlay network can contact services running in our network like mq and db, but if you go inside the container and try to use tracert or ping, they can't connect anything!? Also, if I expose port from container to host, it does not work.
Ingress network is somehow messed up. Does not help to delete all hns networks and let docker create them again. Same situation continues.
Nat network works.
Same problem for us. Containers can't communicate with each other.
@djarvis How did you uninstall that package?
I tried wusa /uninstall /kb:4551853
but it it said "...required by your computer and cannot be uninstalled"
@alexftr we have the same issue with a couple of our environments in Azure. We had the issue happen to one of our dev environments a week back. But at that time we were able to uninstall it. Today it happened in our QA environments, but now we cannot uninstall it. We get the following error:
Did anyone manage to fix the Swarm network issue?
@alexftr Same story as @avinhan, we were able to freely uninstall it a few weeks ago but now we are not. Looking for a fix to this or a way to rip out this update that seems to now be cemented in.
Having the same issue. Uninstalled and now I can access other services by name from other service.
@jabteles How did you uninstall the update?
Uninstalled with the above wusa command did the trick on my workers.
FWIW: it is not only swarm related. I had issues with my kubernetes / flannel / docker workers.
@jabteles How did you uninstall the update?
Start > write "update" > click "Windows Update Settings" > click "View update history" > click "Uninstall updates" > filter "KB551853" in the search field > select update and click Uninstall
ping @daschott @dperny @taylorb-microsoft ptal
@jabteles How did you uninstall the update?
Start > write "update" > click "Windows Update Settings" > click "View update history" > click "Uninstall updates" > filter "KB551853" in the search field > select update and click Uninstall
Ok. Yes. I was able to do that on one machine a few weeks ago but now and on other machines it seems the update cannot be uninstalled so easily.
@jabteles How did you uninstall the update?
Start > write "update" > click "Windows Update Settings" > click "View update history" > click "Uninstall updates" > filter "KB551853" in the search field > select update and click Uninstall
Ok. Yes. I was able to do that on one machine a few weeks ago but now and on other machines it seems the update cannot be uninstalled so easily.
What is the error? Maybe a reboot is necessary before?
What is the error? Maybe a reboot is necessary before?
There is no "uninstall" link in the UI, and this command:
wusa /uninstall /kb:4551853
comes up with some "...required by your computer and cannot be uninstalled" message.
Apparently if you create a Windows Server 2019 VM in Azure today the KB is already installed and is unable to be removed. On VMs created months ago I was able to uninstall this update and prevent it from being installed.
What is the error? Maybe a reboot is necessary before?
There is no "uninstall" link in the UI, and this command:
wusa /uninstall /kb:4551853
comes up with some "...required by your computer and cannot be uninstalled" message.
Apparently if you create a Windows Server 2019 VM in Azure today the KB is already installed and is unable to be removed. On VMs created months ago I was able to uninstall this update and prevent it from being installed.
I understand, can't help on that, I'm using on premises VM's and it was possible.
Thanks for reporting this issue, I will raise this issue with the team for further investigation. If someone has a repro and could be kind enough to run https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1 and share the service/container names, that would be helpful in accelerating resolution.
@djarvis @avinhan Yes, i couldn't uninstall that update. I had to recreate recreate VMs in Azure from snapshot from a month ago. This was the only option for me.
Hi @daschott any updates on a resolution / workaround?
@avinhan workaround is to uninstall the update. Worked for me
Edited: I'm sorry i didnt see you couldnt uninstall the update
@daschott are you still looking for debug logs? I have a fully automated repro
@tfenster yes please. Unfortunately this is not reproducing for me and others that have tried on a Windows Server 2019 machine with latest hotfix. Perhaps can someone share a sample .yaml file as well in case there is something different in the configuration?
@daschott I tried to narrow it down to a simpler scenario and there it doesn't break anymore. I'll need to expand until I get to my more complex scenario to find out where it breaks. If you want, you could clone https://github.com/tfenster/bc-swarm and then follow https://raw.githubusercontent.com/tfenster/BC-Swarm/master/BC/howto.md, which definitely breaks. But I'll try to simplify that, but probably will take me 24-48h to get there
@daschott I can reproduce this morning by spinning up 2 azure vms (--image Win2019Datacenter).
Do the needful to create the swarm (docker swarm init, then join..)
Then spin up some dummy stack using below assuming yml below in file test.yml
docker stack deploy --compose-file test.yml test
Then try ping 1 service from the other (or curl). both commands fail/timeout
docker exec <containerIdofWeb> ping api
docker exec <containerIdofWeb> curl http://api
version: "3.7"
services:
web:
image: mcr.microsoft.com/dotnet/core/samples:aspnetapp-nanoserver-1809
ports:
- "80:80"
deploy:
mode: replicated
restart_policy:
condition: on-failure
api:
image: mcr.microsoft.com/dotnet/core/samples:aspnetapp-nanoserver-1809
ports:
- "8080:80"
deploy:
mode: replicated
restart_policy:
condition: on-failure
@daschott It was a bit more complex than I thought, but here is the simplest scenario I could come up with where it breaks. It needs two networks and a reboot, but then it reliably breaks for me with newer Azure VM images while it still works with those from April 2020. Here are my repro steps:
# Variables for common values
$resourceGroup = "repro-swarm-bug"
$location = "westeurope"
$mgrName = "mgr"
$workerName = "worker"
$cred = New-Object System.Management.Automation.PSCredential ("vmadmin", (ConvertTo-SecureString -String "Passw0rd*123" -AsPlainText -Force ))
New-AzResourceGroup -Name $resourceGroup -Location $location
$subnetConfig = New-AzVirtualNetworkSubnetConfig -Name mySubnet -AddressPrefix 10.0.3.0/24
$vnet = New-AzVirtualNetwork -ResourceGroupName $resourceGroup -Location $location ` -Name MYvNET -AddressPrefix 10.0.3.0/24 -Subnet $subnetConfig
$pipMgr = New-AzPublicIpAddress -ResourceGroupName $resourceGroup -Location $location -Name "mgr$(Get-Random)" -AllocationMethod Static -IdleTimeoutInMinutes 4 -DomainNameLabel "mgr$(Get-Random)" $pipWorker = New-AzPublicIpAddress -ResourceGroupName $resourceGroup -Location $location
-Name "worker$(Get-Random)" -AllocationMethod Static -IdleTimeoutInMinutes 4 -DomainNameLabel "worker$(Get-Random)"
$nsgRuleRDP = New-AzNetworkSecurityRuleConfig -Name myNetworkSecurityGroupRuleRDP -Protocol Tcp -Direction Inbound -Priority 1000 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix *
-DestinationPortRange 3389 -Access Allow
$nsgRuleHTTPS = New-AzNetworkSecurityRuleConfig -Name myNetworkSecurityGroupRuleHttps -Protocol Tcp -Direction Inbound -Priority 1010 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix *
-DestinationPortRange 443 -Access Allow
$nsgMgr = New-AzNetworkSecurityGroup -ResourceGroupName $resourceGroup -Location $location -Name myNetworkSecurityGroupMgr -SecurityRules $nsgRuleRDP,$nsgRuleHTTPS $nsgWorker = New-AzNetworkSecurityGroup -ResourceGroupName $resourceGroup -Location $location
-Name myNetworkSecurityGroupWorker -SecurityRules $nsgRuleRDP,$nsgRuleHTTPS
$nicMgr = New-AzNetworkInterface -Name myNicMgr -ResourceGroupName $resourceGroup -Location $location -SubnetId $vnet.Subnets[0].Id -PublicIpAddressId $pipMgr.Id -NetworkSecurityGroupId $nsgMgr.Id $nicWorker = New-AzNetworkInterface -Name myNicWorker -ResourceGroupName $resourceGroup -Location $location
-SubnetId $vnet.Subnets[0].Id -PublicIpAddressId $pipWorker.Id -NetworkSecurityGroupId $nsgWorker.Id
$vmConfigMgr = New-AzVMConfig -VMName $mgrName -VMSize Standard_D4d_v4 | Set-AzVMOperatingSystem -Windows -ComputerName $mgrName -Credential $cred |
Set-AzVMSourceImage -PublisherName MicrosoftWindowsServer -Offer WindowsServer -Skus 2019-datacenter-core-with-containers -Version latest | Add-AzVMNetworkInterface -Id $nicMgr.Id $vmConfigWorker = New-AzVMConfig -VMName $workerName -VMSize Standard_D4d_v4 |
Set-AzVMOperatingSystem -Windows -ComputerName $workerName -Credential $cred | Set-AzVMSourceImage -PublisherName MicrosoftWindowsServer -Offer WindowsServer -Skus 2019-datacenter-core-with-containers -Version latest |
Add-AzVMNetworkInterface -Id $nicWorker.Id
New-AzVM -ResourceGroupName $resourceGroup -Location $location -VM $vmConfigMgr New-AzVM -ResourceGroupName $resourceGroup -Location $location -VM $vmConfigWorker
4. connect to the mgr vm and run the following commands to validate that the relevant KB is installed, setup the firewall and init the swarm
powershell get-hotfix New-NetFirewallRule -DisplayName "Allow Swarm TCP" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 2377, 7946 | Out-Null New-NetFirewallRule -DisplayName "Allow Swarm UDP" -Direction Inbound -Action Allow -Protocol UDP -LocalPort 4789, 7946 | Out-Null New-NetFirewallRule -DisplayName "Allow HTTPS" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 443 | Out-Null
$content = Invoke-WebRequest -H @{ Metadata = "true" } http://169.254.169.254/metadata/instance?api-version=2017-04-02 -UseBasicParsing $json = ConvertFrom-Json $content $ipaddress = $json.network.interface.ipv4.ipAddress.privateIpAddress Write-Host "Found IP address $ipaddress" Invoke-Expression "docker swarm init --advertise-addr $ipaddress --default-addr-pool 10.10.0.0/16"
5. copy the join command
6. go to the worker vm and run the following commands to validate that the relevant KB is installed, set up the firewall and connect to the swarm, of course your join token will be different
powershell get-hotfix New-NetFirewallRule -DisplayName "Allow Swarm TCP" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 2377, 7946 | Out-Null New-NetFirewallRule -DisplayName "Allow Swarm UDP" -Direction Inbound -Action Allow -Protocol UDP -LocalPort 4789, 7946 | Out-Null
docker swarm join --token SWMTKN-1-4y29h72magt5i6g4nwh3099dy6empntfu7ztm3ynljbrbq6q9z-dluq3j2vwvzlsmxw8j4be9jje 10.0.3.4:2377
7. go back to the mgr and validate that the worker has joined
docker node ls
8. create an overlay network
docker network create --driver=overlay overlay
9. validate that a directly connected IIS container can be reached by running the following command and then accessing port 443 with http (!) on the mgr VM from your laptop --> this should give you the standard IIS page with the blue rectangles and proves that no firewall or anything else is blocking access to port 443
docker run -p 443:80 -d --rm --name iis mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
10. remove the directly connected container
docker rm -f iis
11. run the following to work around a traefik bug, create an overlay network and create a file docker-compose.yml which describes a swarm setup with traefik, portainer and portainer agents. Replace $externaldns with the external DNS name of the mgr and $email with your email address
New-Item -Path c:\le -ItemType Directory | Out-Null New-Item -Path c:\le\acme.json | Out-Null docker network create --driver=overlay traefik-public
@" version: '3.7'
services: traefik: image: traefik:2.2-windowsservercore-1809 command:
- --api.dashboard=true
- --providers.docker.swarmMode=true
- --providers.docker.network=traefik-public
- --providers.docker.exposedbydefault=false
- --providers.docker.endpoint=npipe:////./pipe/docker_engine
- --entrypoints.websecure.address=:443
- --certificatesresolvers.myresolver.acme.tlschallenge=true
- --certificatesresolvers.myresolver.acme.email=$email
- --certificatesresolvers.myresolver.acme.storage=c:/le/acme.json
- --serversTransport.insecureSkipVerify=true
volumes:
- source: 'C:/le'
target: 'C:/le'
type: bind
- source: '\\.\pipe\docker_engine'
target: '\\.\pipe\docker_engine'
type: npipe
ports:
- 443:443
deploy:
placement:
constraints:
- node.role == manager
labels:
- traefik.enable=false
- traefik.http.routers.api.entrypoints=websecure
- traefik.http.routers.api.tls.certresolver=myresolver
- traefik.http.routers.api.rule=Host(``$externaldns``) && (PathPrefix(``/api``) || PathPrefix(``/dashboard``))
- traefik.http.routers.api.service=api@internal
- traefik.http.services.api.loadBalancer.server.port=8080
networks:
- traefik-public
agent: image: portainer/agent:windows1809-amd64 environment: AGENT_CLUSTER_ADDR: tasks.agent volumes:
agent-network deploy: mode: global placement: constraints:
portainer: image: portainer/portainer:windows1809-amd64 command: -H tcp://tasks.agent:9001 --tlsskipverify volumes:
$externaldns
) && PathPrefix(/portainer/
)networks: agent-network: attachable: true traefik-public: external: true
volumes: portainer-data: "@ | Out-File docker-compose.yml
12. deploy that as stack
docker stack deploy -c .\docker-compose.yml mystack
13. validate that you can reach portainer using https://< external dns name >/portainer/
14. reboot the machine
restart-computer -force
15. try to reach portainer again, which for me fails, so the reboot somehow breaks this
16. remove the stack and try the directly connected container again, which should still work
docker stack rm mystack docker run -p 443:80 -d --rm --name iis mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
16. remove the directly connected container and try the stack deployment again
docker rm -f iis docker stack deploy -c .\docker-compose.yml mystack
18. try to reach portainer, which for me now works again, so the directly connected container somehow seems to fix it
19. reboot once more to break it and then try to remove the stack and deploy it again
docker stack rm mystack docker stack deploy -c .\docker-compose.yml mystack
This for me still fails, so removing and adding doesn't seem to work
In order to validate that this is caused by the fix (or at least one of them), do the exact same, but instead of using `-Version latest` in lines 49 and 53 of the first script (step 3), now use `-Version "17763.1158.2004131759"`. In this case, step 15 doesn't break
Let me know if you need anything else or the steps don't work for you
@tfenster yes please. Unfortunately this is not reproducing for me and others that have tried on a Windows Server 2019 machine with latest hotfix. Perhaps can someone share a sample .yaml file as well in case there is something different in the configuration?
Do you mean by hotfix the 2020-06 cumulative update?
@daschott I posted a failing scenario here but deleted it since I need to revise it a bit. I'll have a failing scenario shortly...
@tfenster @daschott This is the scenario that I just tested. It reproduces the problem with KB4551853 installed. It does seem to have to do with container-to-container communication in a swarm network:
On Windows Server 2019 (any flavor) with KB4551853 installed, install docker:
What we will do is run two simple containers. Simpleweb when hit will return a simple HTML page. Simpleweb2 when hit will internally fetch http://simpleweb and report back the simple web page. We will do this first outside of swarm to verify this all works:
docker-compose.yaml:
version: '3.7'
services:
simpleweb:
image: "djarvis8/simpleweb-win:latest"
simpleweb2:
image: "djarvis8/simpleweb-win:latest"
ports:
- target: 80
published: 80
protocol: tcp
mode: ingress
environment:
Url: http://simpleweb
Fetching http://simpleweb Result: OK Content:
You have reached this simple web page
Now we will launch the container within a swarm and watch it fail:
* docker swarm init --advertise-addr={{IP ADDRESS}}
* docker stack deploy -c simple.yaml testapp
simple.yaml:
version: '3.8' services:
simpleweb: image: "djarvis8/simpleweb-win:latest" networks:
testnet
simpleweb2: image: "djarvis8/simpleweb-win:latest" networks:
networks: testnet: driver: overlay attachable: true
* Again from a remote web browser or curl, try to pull down the web page. You should see failure:
Fetching http://simpleweb Exception: One or more errors occurred. (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
Is there a workaround without removing the update? Our internal company security policy prevents me from uninstalling of KB4551853.
@daschott Any feedback on this? Can you repro the issue now?
@tfenster To give an update, product group is still actively working together with support teams to analyze repro traces collected from a environment.
@xargon180 What is the scenario you are targeting and looking for a workaround? Depending on the scenario, you may be able to use hostPort publishing mode. But I would need to hear more to confirm.
@daschott Thanks for the tip. Unfortunately my services could still not communicate with each other in host mode. Temporarily I rolled out my stack with docker-compose to a single server. I hope that there will be a real solution soon :pray:.
@xargon180 Same situation here, we're going out with docker-compose as we see no other option right now.
@daschott was the repo description above of any value?
I'm afraid this is also happening with KB4561608.
I had the same issue again, services not reachable from other services.
Noticed this KB was installed. Uninstalled, started working. They are both Cumulative Updates, from May and June.
I'm afraid this is also happening with KB4561608.
I had the same issue again, services not reachable from other services.
Noticed this KB was installed. Uninstalled, started working. They are both Cumulative Updates, from May and June.
Confirmed, nothing could get in. Removed this cumulative update and boom, started working. I will restore to a fresh Windows install adding cumulative updates starting from April going backwards until I find which cumulative update broke it.
Thanks everyone for the repro descriptions. Team is still investigating this issue. Please expect an update early next week.
There was a regression found in the port pool optimizations that shipped with the cumulative updates.
Unfortunately, the fix for this regression is not available yet on any public Windows releases. Microsoft is working on generating a new patch for Windows Server 2019. ETA for creating this patch is ~1 week.
@djarvis Please reach out to Microsoft customer support to receive this patch as soon as it's ready for sharing.
awesome! looking forward to being able to get our servers all updated with latest patches again :)
@daschott Awesome, sounds good!
The patch has been generated and is ready for sharing with impacted users that request it. Please reach out to your Microsoft customer service contact to ask for a patch containing "Separate SNAT pools for each source VIP".
Once there is enough confidence and confirmation from users that the patch indeed resolves this regression, I will share which cumulative update would include this and when it will be released.
Any news? Has anyone tested the secret patch yet?
@xargon180 We're currently testing it out.
can someone please advise when we should expect this to be public domain? It basically renders production servers and services offline.. i ended up rebuilding my Windows Hosts with an old ISO and blocking windows updates to work around this.
yeah, this is becoming a bigger problem given that our servers are unable to be patched since April! @daschott do we have a patch number or reference? Our operations team haven't been able to get a hold of this patch from our support partners yet.
@daschott The patch works for us- our containers are able to communicate again.
@djarvis hi, please tell me how to get this patch?
For me, installing the cumulative update KB4558998, restarting the nodes and rotating the roles solved the problem.
@immon4ik For me, patch KB4558998 makes no differences. Swarm still does not work. @daschott Is there an official statement in which patch the bug fix is included?
@xargon180 No publicly available patch has the fix included yet. Only private patch that can be requested through Microsoft Support.
The tentative ETA for the cumulative update (public patch) is 15th September 2020.
Any idea has this patch been released in september cumulative? KB4570333 https://support.microsoft.com/en-us/help/4570333/windows-10-update-kb4570333 that does not say anything about docker.
I was hopeful too, but i just spun up 2 vms in azure, applied september patches and still can't connect multiple services together :( pretty disappointing this is taking so long given that this is a pretty fundamental regression break to docker swarm on a Win2019 environment. Something that the MS docs say is still supported.
Windows Server 2019 running Docker/Swarm, ingress network was working fine until this was installed:
2020-05 Cumulative Update for Windows Server 2019 (1809) for x64-based Systems (KB4551853)
This broke something with the ingress network such that no traffic could enter through any exposed/published ports, either from the local machine or remotely.
Uninstalling this update made it all work again.
Deploying containers on the nat network worked fine. Access from the containers to the external network remained working in a docker swarm, but no traffic could come in.