moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.77k stars 18.67k forks source link

Windows Server 2019 KB4551853 Update breaks Docker Swarm Ingress network #40998

Closed djarvis closed 4 years ago

djarvis commented 4 years ago

Windows Server 2019 running Docker/Swarm, ingress network was working fine until this was installed:

2020-05 Cumulative Update for Windows Server 2019 (1809) for x64-based Systems (KB4551853)

This broke something with the ingress network such that no traffic could enter through any exposed/published ports, either from the local machine or remotely.

Uninstalling this update made it all work again.

Deploying containers on the nat network worked fine. Access from the containers to the external network remained working in a docker swarm, but no traffic could come in.

simmohall commented 4 years ago

yep we just had same problem. Our services within the swarm couldn't communicate with each other either. :( :(

veepee78 commented 4 years ago

I had issues already with 2020-04 cumulative update.

Currently my swarm status is that containers using overlay network can contact services running in our network like mq and db, but if you go inside the container and try to use tracert or ping, they can't connect anything!? Also, if I expose port from container to host, it does not work.

Ingress network is somehow messed up. Does not help to delete all hns networks and let docker create them again. Same situation continues.

Nat network works.

alexftr commented 4 years ago

Same problem for us. Containers can't communicate with each other. @djarvis How did you uninstall that package? I tried wusa /uninstall /kb:4551853 but it it said "...required by your computer and cannot be uninstalled"

avinhan commented 4 years ago

@alexftr we have the same issue with a couple of our environments in Azure. We had the issue happen to one of our dev environments a week back. But at that time we were able to uninstall it. Today it happened in our QA environments, but now we cannot uninstall it. We get the following error:

image

Did anyone manage to fix the Swarm network issue?

djarvis commented 4 years ago

@alexftr Same story as @avinhan, we were able to freely uninstall it a few weeks ago but now we are not. Looking for a fix to this or a way to rip out this update that seems to now be cemented in.

jabteles commented 4 years ago

Having the same issue. Uninstalled and now I can access other services by name from other service.

djarvis commented 4 years ago

@jabteles How did you uninstall the update?

EagleIJoe commented 4 years ago

Uninstalled with the above wusa command did the trick on my workers.

FWIW: it is not only swarm related. I had issues with my kubernetes / flannel / docker workers.

jabteles commented 4 years ago

@jabteles How did you uninstall the update?

Start > write "update" > click "Windows Update Settings" > click "View update history" > click "Uninstall updates" > filter "KB551853" in the search field > select update and click Uninstall

thaJeztah commented 4 years ago

ping @daschott @dperny @taylorb-microsoft ptal

djarvis commented 4 years ago

@jabteles How did you uninstall the update?

Start > write "update" > click "Windows Update Settings" > click "View update history" > click "Uninstall updates" > filter "KB551853" in the search field > select update and click Uninstall

Ok. Yes. I was able to do that on one machine a few weeks ago but now and on other machines it seems the update cannot be uninstalled so easily.

jabteles commented 4 years ago

@jabteles How did you uninstall the update?

Start > write "update" > click "Windows Update Settings" > click "View update history" > click "Uninstall updates" > filter "KB551853" in the search field > select update and click Uninstall

Ok. Yes. I was able to do that on one machine a few weeks ago but now and on other machines it seems the update cannot be uninstalled so easily.

What is the error? Maybe a reboot is necessary before?

djarvis commented 4 years ago

What is the error? Maybe a reboot is necessary before?

There is no "uninstall" link in the UI, and this command:

wusa /uninstall /kb:4551853

comes up with some "...required by your computer and cannot be uninstalled" message.

Apparently if you create a Windows Server 2019 VM in Azure today the KB is already installed and is unable to be removed. On VMs created months ago I was able to uninstall this update and prevent it from being installed.

jabteles commented 4 years ago

What is the error? Maybe a reboot is necessary before?

There is no "uninstall" link in the UI, and this command:

wusa /uninstall /kb:4551853

comes up with some "...required by your computer and cannot be uninstalled" message.

Apparently if you create a Windows Server 2019 VM in Azure today the KB is already installed and is unable to be removed. On VMs created months ago I was able to uninstall this update and prevent it from being installed.

I understand, can't help on that, I'm using on premises VM's and it was possible.

daschott commented 4 years ago

Thanks for reporting this issue, I will raise this issue with the team for further investigation. If someone has a repro and could be kind enough to run https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1 and share the service/container names, that would be helpful in accelerating resolution.

alexftr commented 4 years ago

@djarvis @avinhan Yes, i couldn't uninstall that update. I had to recreate recreate VMs in Azure from snapshot from a month ago. This was the only option for me.

avinhan commented 4 years ago

Hi @daschott any updates on a resolution / workaround?

jabteles commented 4 years ago

@avinhan workaround is to uninstall the update. Worked for me

Edited: I'm sorry i didnt see you couldnt uninstall the update

tfenster commented 4 years ago

@daschott are you still looking for debug logs? I have a fully automated repro

daschott commented 4 years ago

@tfenster yes please. Unfortunately this is not reproducing for me and others that have tried on a Windows Server 2019 machine with latest hotfix. Perhaps can someone share a sample .yaml file as well in case there is something different in the configuration?

tfenster commented 4 years ago

@daschott I tried to narrow it down to a simpler scenario and there it doesn't break anymore. I'll need to expand until I get to my more complex scenario to find out where it breaks. If you want, you could clone https://github.com/tfenster/bc-swarm and then follow https://raw.githubusercontent.com/tfenster/BC-Swarm/master/BC/howto.md, which definitely breaks. But I'll try to simplify that, but probably will take me 24-48h to get there

simmohall commented 4 years ago

@daschott I can reproduce this morning by spinning up 2 azure vms (--image Win2019Datacenter). Do the needful to create the swarm (docker swarm init, then join..) Then spin up some dummy stack using below assuming yml below in file test.yml docker stack deploy --compose-file test.yml test Then try ping 1 service from the other (or curl). both commands fail/timeout

docker exec <containerIdofWeb> ping api
docker exec <containerIdofWeb> curl http://api
version: "3.7"
services:
  web:
    image: mcr.microsoft.com/dotnet/core/samples:aspnetapp-nanoserver-1809
    ports:
      - "80:80"
    deploy:
      mode: replicated
      restart_policy:
        condition: on-failure
  api:
    image: mcr.microsoft.com/dotnet/core/samples:aspnetapp-nanoserver-1809
    ports:
      - "8080:80"
    deploy:
      mode: replicated
      restart_policy:
        condition: on-failure
tfenster commented 4 years ago

@daschott It was a bit more complex than I thought, but here is the simplest scenario I could come up with where it breaks. It needs two networks and a reboot, but then it reliably breaks for me with newer Azure VM images while it still works with those from April 2020. Here are my repro steps:

  1. go to shell.azure.com
  2. switch to PowerShell
  3. run the following script to create a manager and a worker vm, based on https://docs.microsoft.com/en-us/azure/virtual-machines/scripts/virtual-machines-windows-powershell-sample-create-vm
    
    # Variables for common values
    $resourceGroup = "repro-swarm-bug"
    $location = "westeurope"
    $mgrName = "mgr"
    $workerName = "worker"

Create user object

$cred = New-Object System.Management.Automation.PSCredential ("vmadmin", (ConvertTo-SecureString -String "Passw0rd*123" -AsPlainText -Force ))

Create a resource group

New-AzResourceGroup -Name $resourceGroup -Location $location

Create a subnet configuration

$subnetConfig = New-AzVirtualNetworkSubnetConfig -Name mySubnet -AddressPrefix 10.0.3.0/24

Create a virtual network

$vnet = New-AzVirtualNetwork -ResourceGroupName $resourceGroup -Location $location ` -Name MYvNET -AddressPrefix 10.0.3.0/24 -Subnet $subnetConfig

Create a public IP address and specify a DNS name

$pipMgr = New-AzPublicIpAddress -ResourceGroupName $resourceGroup -Location $location -Name "mgr$(Get-Random)" -AllocationMethod Static -IdleTimeoutInMinutes 4 -DomainNameLabel "mgr$(Get-Random)" $pipWorker = New-AzPublicIpAddress -ResourceGroupName $resourceGroup -Location $location -Name "worker$(Get-Random)" -AllocationMethod Static -IdleTimeoutInMinutes 4 -DomainNameLabel "worker$(Get-Random)"

Create an inbound network security group rule for port 3389 and 80

$nsgRuleRDP = New-AzNetworkSecurityRuleConfig -Name myNetworkSecurityGroupRuleRDP -Protocol Tcp -Direction Inbound -Priority 1000 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix * -DestinationPortRange 3389 -Access Allow $nsgRuleHTTPS = New-AzNetworkSecurityRuleConfig -Name myNetworkSecurityGroupRuleHttps -Protocol Tcp -Direction Inbound -Priority 1010 -SourceAddressPrefix * -SourcePortRange * -DestinationAddressPrefix * -DestinationPortRange 443 -Access Allow

Create a network security group

$nsgMgr = New-AzNetworkSecurityGroup -ResourceGroupName $resourceGroup -Location $location -Name myNetworkSecurityGroupMgr -SecurityRules $nsgRuleRDP,$nsgRuleHTTPS $nsgWorker = New-AzNetworkSecurityGroup -ResourceGroupName $resourceGroup -Location $location -Name myNetworkSecurityGroupWorker -SecurityRules $nsgRuleRDP,$nsgRuleHTTPS

Create a virtual network card and associate with public IP address and NSG

$nicMgr = New-AzNetworkInterface -Name myNicMgr -ResourceGroupName $resourceGroup -Location $location -SubnetId $vnet.Subnets[0].Id -PublicIpAddressId $pipMgr.Id -NetworkSecurityGroupId $nsgMgr.Id $nicWorker = New-AzNetworkInterface -Name myNicWorker -ResourceGroupName $resourceGroup -Location $location -SubnetId $vnet.Subnets[0].Id -PublicIpAddressId $pipWorker.Id -NetworkSecurityGroupId $nsgWorker.Id

Create a virtual machine configuration

$vmConfigMgr = New-AzVMConfig -VMName $mgrName -VMSize Standard_D4d_v4 | Set-AzVMOperatingSystem -Windows -ComputerName $mgrName -Credential $cred | Set-AzVMSourceImage -PublisherName MicrosoftWindowsServer -Offer WindowsServer -Skus 2019-datacenter-core-with-containers -Version latest | Add-AzVMNetworkInterface -Id $nicMgr.Id $vmConfigWorker = New-AzVMConfig -VMName $workerName -VMSize Standard_D4d_v4 | Set-AzVMOperatingSystem -Windows -ComputerName $workerName -Credential $cred | Set-AzVMSourceImage -PublisherName MicrosoftWindowsServer -Offer WindowsServer -Skus 2019-datacenter-core-with-containers -Version latest | Add-AzVMNetworkInterface -Id $nicWorker.Id

Create a virtual machine

New-AzVM -ResourceGroupName $resourceGroup -Location $location -VM $vmConfigMgr New-AzVM -ResourceGroupName $resourceGroup -Location $location -VM $vmConfigWorker

4. connect to the mgr vm and run the following commands to validate that the relevant KB is installed, setup the firewall and init the swarm

powershell get-hotfix New-NetFirewallRule -DisplayName "Allow Swarm TCP" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 2377, 7946 | Out-Null New-NetFirewallRule -DisplayName "Allow Swarm UDP" -Direction Inbound -Action Allow -Protocol UDP -LocalPort 4789, 7946 | Out-Null New-NetFirewallRule -DisplayName "Allow HTTPS" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 443 | Out-Null

$content = Invoke-WebRequest -H @{ Metadata = "true" } http://169.254.169.254/metadata/instance?api-version=2017-04-02 -UseBasicParsing $json = ConvertFrom-Json $content $ipaddress = $json.network.interface.ipv4.ipAddress.privateIpAddress Write-Host "Found IP address $ipaddress" Invoke-Expression "docker swarm init --advertise-addr $ipaddress --default-addr-pool 10.10.0.0/16"

5. copy the join command
6. go to the worker vm and run the following commands to validate that the relevant KB is installed, set up the firewall and connect to the swarm, of course your join token will be different

powershell get-hotfix New-NetFirewallRule -DisplayName "Allow Swarm TCP" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 2377, 7946 | Out-Null New-NetFirewallRule -DisplayName "Allow Swarm UDP" -Direction Inbound -Action Allow -Protocol UDP -LocalPort 4789, 7946 | Out-Null

docker swarm join --token SWMTKN-1-4y29h72magt5i6g4nwh3099dy6empntfu7ztm3ynljbrbq6q9z-dluq3j2vwvzlsmxw8j4be9jje 10.0.3.4:2377

7. go back to the mgr and validate that the worker has joined

docker node ls

8. create an overlay network

docker network create --driver=overlay overlay

9. validate that a directly connected IIS container can be reached by running the following command and then accessing port 443 with http (!) on the mgr VM from your laptop --> this should give you the standard IIS page with the blue rectangles and proves that no firewall or anything else is blocking access to port 443

docker run -p 443:80 -d --rm --name iis mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019

10. remove the directly connected container

docker rm -f iis

11. run the following to work around a traefik bug, create an overlay network and create a file docker-compose.yml which describes a swarm setup with traefik, portainer and portainer agents. Replace $externaldns with the external DNS name of the mgr and $email with your email address

New-Item -Path c:\le -ItemType Directory | Out-Null New-Item -Path c:\le\acme.json | Out-Null docker network create --driver=overlay traefik-public

@" version: '3.7'

services: traefik: image: traefik:2.2-windowsservercore-1809 command:

- --log.level=DEBUG

  - --api.dashboard=true
  - --providers.docker.swarmMode=true
  - --providers.docker.network=traefik-public
  - --providers.docker.exposedbydefault=false
  - --providers.docker.endpoint=npipe:////./pipe/docker_engine
  - --entrypoints.websecure.address=:443
  - --certificatesresolvers.myresolver.acme.tlschallenge=true
  - --certificatesresolvers.myresolver.acme.email=$email
  - --certificatesresolvers.myresolver.acme.storage=c:/le/acme.json
  - --serversTransport.insecureSkipVerify=true
volumes:
  - source: 'C:/le'
    target: 'C:/le'
    type: bind
  - source: '\\.\pipe\docker_engine'
    target: '\\.\pipe\docker_engine'
    type: npipe
ports:
  - 443:443
deploy:
  placement:
    constraints:
      - node.role == manager
  labels:
    - traefik.enable=false
    - traefik.http.routers.api.entrypoints=websecure
    - traefik.http.routers.api.tls.certresolver=myresolver
    - traefik.http.routers.api.rule=Host(``$externaldns``) && (PathPrefix(``/api``) || PathPrefix(``/dashboard``))
    - traefik.http.routers.api.service=api@internal
    - traefik.http.services.api.loadBalancer.server.port=8080
networks:
  - traefik-public

agent: image: portainer/agent:windows1809-amd64 environment: AGENT_CLUSTER_ADDR: tasks.agent volumes:

networks: agent-network: attachable: true traefik-public: external: true

volumes: portainer-data: "@ | Out-File docker-compose.yml

12. deploy that as stack

docker stack deploy -c .\docker-compose.yml mystack

13. validate that you can reach portainer using https://< external dns name >/portainer/
14. reboot the machine

restart-computer -force

15. try to reach portainer again, which for me fails, so the reboot somehow breaks this
16. remove the stack and try the directly connected container again, which should still work

docker stack rm mystack docker run -p 443:80 -d --rm --name iis mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019

16. remove the directly connected container and try the stack deployment again

docker rm -f iis docker stack deploy -c .\docker-compose.yml mystack

18. try to reach portainer, which for me now works again, so the directly connected container somehow seems to fix it
19. reboot once more to break it and then try to remove the stack and deploy it again

docker stack rm mystack docker stack deploy -c .\docker-compose.yml mystack



This for me still fails, so removing and adding doesn't seem to work

In order to validate that this is caused by the fix (or at least one of them), do the exact same, but instead of using `-Version latest` in lines 49 and 53 of the first script (step 3), now use `-Version "17763.1158.2004131759"`. In this case, step 15 doesn't break

Let me know if you need anything else or the steps don't work for you
veepee78 commented 4 years ago

@tfenster yes please. Unfortunately this is not reproducing for me and others that have tried on a Windows Server 2019 machine with latest hotfix. Perhaps can someone share a sample .yaml file as well in case there is something different in the configuration?

Do you mean by hotfix the 2020-06 cumulative update?

djarvis commented 4 years ago

@daschott I posted a failing scenario here but deleted it since I need to revise it a bit. I'll have a failing scenario shortly...

djarvis commented 4 years ago

@tfenster @daschott This is the scenario that I just tested. It reproduces the problem with KB4551853 installed. It does seem to have to do with container-to-container communication in a swarm network:

On Windows Server 2019 (any flavor) with KB4551853 installed, install docker:

What we will do is run two simple containers. Simpleweb when hit will return a simple HTML page. Simpleweb2 when hit will internally fetch http://simpleweb and report back the simple web page. We will do this first outside of swarm to verify this all works:

docker-compose.yaml:

version: '3.7'
services:      

  simpleweb:
    image: "djarvis8/simpleweb-win:latest"

  simpleweb2:
    image: "djarvis8/simpleweb-win:latest"
    ports:
      - target: 80
        published: 80
        protocol: tcp
        mode: ingress
    environment:
      Url: http://simpleweb

Fetching http://simpleweb Result: OK Content: You have reached this simple web page


Now we will launch the container within a swarm and watch it fail:

* docker swarm init --advertise-addr={{IP ADDRESS}}
* docker stack deploy -c simple.yaml testapp

simple.yaml:

version: '3.8' services:

simpleweb: image: "djarvis8/simpleweb-win:latest" networks:

networks: testnet: driver: overlay attachable: true


* Again from a remote web browser or curl, try to pull down the web page.  You should see failure:

Fetching http://simpleweb Exception: One or more errors occurred. (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)

xargon180 commented 4 years ago

Is there a workaround without removing the update? Our internal company security policy prevents me from uninstalling of KB4551853.

tfenster commented 4 years ago

@daschott Any feedback on this? Can you repro the issue now?

daschott commented 4 years ago

@tfenster To give an update, product group is still actively working together with support teams to analyze repro traces collected from a environment.

@xargon180 What is the scenario you are targeting and looking for a workaround? Depending on the scenario, you may be able to use hostPort publishing mode. But I would need to hear more to confirm.

xargon180 commented 4 years ago

@daschott Thanks for the tip. Unfortunately my services could still not communicate with each other in host mode. Temporarily I rolled out my stack with docker-compose to a single server. I hope that there will be a real solution soon :pray:.

djarvis commented 4 years ago

@xargon180 Same situation here, we're going out with docker-compose as we see no other option right now.

@daschott was the repo description above of any value?

jabteles commented 4 years ago

I'm afraid this is also happening with KB4561608.

I had the same issue again, services not reachable from other services.

Noticed this KB was installed. Uninstalled, started working. They are both Cumulative Updates, from May and June.

mpnewcomb commented 4 years ago

I'm afraid this is also happening with KB4561608.

I had the same issue again, services not reachable from other services.

Noticed this KB was installed. Uninstalled, started working. They are both Cumulative Updates, from May and June.

Confirmed, nothing could get in. Removed this cumulative update and boom, started working. I will restore to a fresh Windows install adding cumulative updates starting from April going backwards until I find which cumulative update broke it.

mpnewcomb commented 4 years ago

April 21, 2020 KB4550969 (OS Build 17763.1192), broke swarm ingress. I uninstalled this update and it went back to working. April 14, 2020 KB4549949 (OS Build 17763.1158), confirmed working.

daschott commented 4 years ago

Thanks everyone for the repro descriptions. Team is still investigating this issue. Please expect an update early next week.

daschott commented 4 years ago

There was a regression found in the port pool optimizations that shipped with the cumulative updates.

Unfortunately, the fix for this regression is not available yet on any public Windows releases. Microsoft is working on generating a new patch for Windows Server 2019. ETA for creating this patch is ~1 week.

@djarvis Please reach out to Microsoft customer support to receive this patch as soon as it's ready for sharing.

simmohall commented 4 years ago

awesome! looking forward to being able to get our servers all updated with latest patches again :)

djarvis commented 4 years ago

@daschott Awesome, sounds good!

daschott commented 4 years ago

The patch has been generated and is ready for sharing with impacted users that request it. Please reach out to your Microsoft customer service contact to ask for a patch containing "Separate SNAT pools for each source VIP".

Once there is enough confidence and confirmation from users that the patch indeed resolves this regression, I will share which cumulative update would include this and when it will be released.

xargon180 commented 4 years ago

Any news? Has anyone tested the secret patch yet?

djarvis commented 4 years ago

@xargon180 We're currently testing it out.

ncresswell commented 4 years ago

can someone please advise when we should expect this to be public domain? It basically renders production servers and services offline.. i ended up rebuilding my Windows Hosts with an old ISO and blocking windows updates to work around this.

simmohall commented 4 years ago

yeah, this is becoming a bigger problem given that our servers are unable to be patched since April! @daschott do we have a patch number or reference? Our operations team haven't been able to get a hold of this patch from our support partners yet.

djarvis commented 4 years ago

@daschott The patch works for us- our containers are able to communicate again.

immon4ik commented 4 years ago

@djarvis hi, please tell me how to get this patch?

immon4ik commented 4 years ago

For me, installing the cumulative update KB4558998, restarting the nodes and rotating the roles solved the problem.

xargon180 commented 4 years ago

@immon4ik For me, patch KB4558998 makes no differences. Swarm still does not work. @daschott Is there an official statement in which patch the bug fix is included?

daschott commented 4 years ago

@xargon180 No publicly available patch has the fix included yet. Only private patch that can be requested through Microsoft Support.

The tentative ETA for the cumulative update (public patch) is 15th September 2020.

veepee78 commented 4 years ago

Any idea has this patch been released in september cumulative? KB4570333 https://support.microsoft.com/en-us/help/4570333/windows-10-update-kb4570333 that does not say anything about docker.

simmohall commented 4 years ago

I was hopeful too, but i just spun up 2 vms in azure, applied september patches and still can't connect multiple services together :( pretty disappointing this is taking so long given that this is a pretty fundamental regression break to docker swarm on a Win2019 environment. Something that the MS docs say is still supported.