Docker Swarm overlay: packet size issue with hybrid hosts (linux / windows)

crou commented 7 years ago

Description

I have a Swarm with 1 linux host (master) and 1 windows 2016 (worker). an Overlay network has been added. A service container running on linux is able to call a service continer on the windows host as long the packet is not larger then 1434bytes. As soon I send a packet (larger HTTP POST request), the linux service retry the connection to the destination and the communication never completes.

I suspected a problem with the MTU on a host but it looks like both are properly set to default 1500.

Steps to reproduce the issue:

Create an hybrid swarm, 1 linux, 1 windows (added labels for linux and windows)
Create an overlay: docker network create --subnet 10.100.0.0/24 --gateway 10.100.0.1 -d overlay mynet (I tried with ou without --opt com.docker.network.mtu=xxxx but there is no change)
start a windows service container that can accept any HTTP request or any TCP socket: docker service create --network mynet --constraint node.labels.os==windows --name winapp crou/mytest main.exe
run a linux service with curl on it docker service create --network mynet --constraint 'node.labels.os == linux' --name httpd httpd

Describe the results you received: From the linux service, this request is not working, never completes (more 1434bytes): curl -X POST -H "Content-Type: application/json" -H "X-Header1: 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111" -H "X-Header2: 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222333333333333X" -d '{ "description": "this is a test" }' --verbose --progress-bar http://10.100.0.5:8080/login

This linux client never receive a response.

Describe the results you expected:

A response is expected from the windows container.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64
 Experimental: false

----------------------------------------------------

Client:
 Version:      17.03.1-ee-3
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   3fcee33
 Built:        Thu Mar 30 19:31:22 2017
 OS/Arch:      windows/amd64

Server:
 Version:      17.03.1-ee-3
 API version:  1.27 (minimum version 1.24)
 Go version:   go1.7.5
 Git commit:   3fcee33
 Built:        Thu Mar 30 19:31:22 2017
 OS/Arch:      windows/amd64
 Experimental: false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 30
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: jduh1asno2nmua1turn54bu1k
 Is Manager: true
 ClusterID: bdwg4nv8te6qig8dk42h904lc
 Managers: 1
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.0.5
 Manager Addresses:
  10.0.0.5:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-79-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.851GiB
Name: lin
ID: LOKZ:FD77:ZTPH:OFLD:5QB2:PXZ5:FHVJ:U4KW:B5HV:STM7:5WZZ:7LN4
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

------------------------------------------------------
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 17.03.1-ee-3
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: l2bridge l2tunnel nat null overlay transparent
Swarm: active
 NodeID: we6bzffs3jsxy0ufjoppqqjox
 Is Manager: false
 Node Address: 10.0.0.9
 Manager Addresses:
  10.0.0.5:2377
Default Isolation: process
Kernel Version: 10.0 14393 (14393.1198.amd64fre.rs1_release_sec.170427-1353)
Operating System: Windows Server 2016 Datacenter
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 7 GiB
Name: win03
ID: 2W2I:IVB4:W75T:LPY4:VB4N:L5KT:VDBL:SQ4G:UCIV:AVDU:3ZIR:PLBV
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: -1
 Goroutines: 83
 System Time: 2017-06-08T19:29:33.9598085Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Both VMs are running on Azure.

crou commented 7 years ago

Big/larger request that is looping in a TCP Retransmission pattern:

Compared to a Small request, properly working:

friism commented 7 years ago

cc @mavenugo @dineshgovindasamy @kallie-b

crou commented 7 years ago

Hi, still no update? I have the same behavior with hybrid docker swarm running on Azure and on a vmware infrastructure.

Here is a screen capture / demo between 2 linux and 1 windows. Linux to linux is working great, no packet size issue. But on Windows <=> Linux, it works as long as I send packet <= 1450. Any larget packet sent, the is no answers.

https://youtu.be/Gs5UncrmR_0

kallie-b commented 7 years ago

@crou -- Apologies for the delay here.

So as I look at your docker service create commands above, I'm seeing an issue. You're missing the flag that sets dnsrr as the endpoint mode for your services. Today, the default service endpoint mode (vip) is not supported on Windows, so you need to make sure to specify the dnsrr endpoint for all services running on a cluster that includes Windows nodes.

So, try this. Run both of your networks again, but include the --endpoint-mode dnsrr flag in your command:

docker service create --network mynet --constraint node.labels.os==windows --name winapp --endpoint-mode dnsrr crou/mytest main.exe

docker service create --network mynet --constraint 'node.labels.os == linux' --name httpd --endpoint-mode dnsrr httpd

Let me know if this fixes your issue...

(Also, see our docs for more context on why you must include the dnsrr specification during service creation today)

crou commented 7 years ago

Thanks for the answer @kallie-b but that doesn't help. I've tried the service (both on linux and win box) with the dnsrr mode and it doesn't change anything. The ping work up to 1450 bytes. But any packet larger than this, request fails (icmp or tcp).

kallie-b commented 7 years ago

@crou Okay, thanks. Let me check with my team on this, and I'll get back to you asap!

crou commented 7 years ago

In the meantime, I've also opened a ticket with Microsoft and after a few message exchanges, I've got the following email from a MS Support Escalation Engineer:

"At this point the PG(product group) has said that this is bug which they need to investigate and fix it. We don’t have any time lines yet as to when this would-be root caused and when the fix would be released. It can take a 2 or 3 weeks and even more than that. "[...]

Well, it doesn't look good so far for Hybrid Linux and Windows Docker Swarm. @kallie-b , let me know if you have more feedback from your team, because even if you demonstrated that an hybrid scenario is possible, pushing just a bit more your lab will confirm that the network is not working between your containers.

rn commented 7 years ago

@crou thanks for the detailed report. Could you provide a pcap file of the traffic captured on the Linux VM (host) eth0 interface and ideally also a packet capture from the Windows host. The overlay networking (VXLAN) adds an overhead of 50 bytes hence we see fragmentation but it's not clear where things go wrong.

crou commented 7 years ago

Here are the pcap files.
issue-33596.zip

Both scenarios have been executed by running the command from the linux container to the windows container.

The first working scenario where PING and CURL requests are working WORKING: ping 10.100.1.4 -s 1342 curl http://10.100.1.4:8080 (linux and windows host pcap files)

and a NOT-WORKING scenario where the request exceed the working threshold: ping 10.100.1.4 -s 1343 curl -X POST -H "param1:A............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Z" http://10.100.1.4:8080 --verbose

The capture seems to confirm that the windows host receive the proper request but fails to handle the query to the container.

crou commented 7 years ago

@rn Did you find anything interesting in the pcap files?

crou commented 7 years ago

I've received an update from the MS Escalation support engineer regarding the issue. Apparently it's now a bug confirmed on the windows side:

"I had the conversation with pg (Product Group) and they say that they would fix this issue for rs1 . The fix would come possibly in terms of an update. At this point I am not sure how much time it would take but a rough estimation that I have received from PG is 2 months . "[... ]

From what I experimented, as long there is no update from Microsoft, no mixed Docker swarm is fully working (multi host, mixed Linux and windows Containers)

thaJeztah commented 7 years ago

Thanks for the update @crou

fcrisciani commented 7 years ago

@crou did you try to create the overlay network specifying -o com.docker.network.driver.mtu=XXX where XXX is the MTU that works for both linux and win. Maybe can be a workaround

kallie-b commented 7 years ago

@crou

Apologies for the delayed reply. In checking with my team, another way you could try getting around this issue would be to increase the size of the MTU directly on the host network adapters (we recommend doing so by at least 160 Bytes). Of course, to prevent fragmentation with this approach you would also need to increase the MTU on all layer-2 hops (e.g. switch ports) along the impacted data path.

One way to do this could be to enable jumbo frames across your adapters and L2 devices.

crou commented 7 years ago

@fcrisciani I've tried setting the MTU at the overlay driver with com.docker.network.driver.mtu, but using this path, I can see the MTU is properly applied inside my linux containers, however, there may be something wrong not being passed to the windows host because the mtu inside the windows container is still 1500.

There only way I could have the mtu set to 1450 on the windows container was to set it to the daemon level with the mtu option in the daemon.json

But in both cases, it doesn't help with the communication between mixed containers :-(

@kallie-b , On Azure vms, I've tried changing the mtu at the vm and container level but I had no chance either. Maybe I missed something, I can double-check my settings again but in that case, I don't know how layer-2 hops in Azure will impact, that layer is already an overlay from what I know...

JMesser81 commented 7 years ago

@crou, @fcrisciani cc @kallie-b

Thanks for reporting this issue. We discovered a bug in the Windows platform which does not handle fragmented VXLAN packets correctly. A work-around for this will be for you to update Linux hosts to not send fragmented VXLAN packets (if possible).

Apologies for the inconvenience.

fcrisciani commented 7 years ago

@JMesser81 thanks for the update, can you also investigate why and if the com.docker.network.driver.mtu is not properly configured on windows side? That would had been a good workaround to avoid fragmentation but as @crou is mentioning here (https://github.com/moby/moby/issues/33596#issuecomment-315734234) looks like windows is not properly enforcing it.

crou commented 7 years ago

@JMesser81 Do you have an idea how to set linux hosts to not send fragmented VXLAN packets? I've googled a bit without finding the answer...

crou commented 7 years ago

@JMesser81 @kallie-b , do you have an idea when a fix is expected for that nasty network bug?

I've sent a request for more information on the support ticket I have opened with Microsoft, but still no news.

kallie-b commented 7 years ago

@crou -- Thanks for the nudge, checking with the team...

kallie-b commented 7 years ago

@crou -- We aren't able to provide a timeline for this fix right now. But we do need to identify a suitable workaround in the meantime. I'm working with my team to see if one can be identified/fleshed out.

mle-ii commented 6 years ago

@kallie-b Any workaround found yet? If this is the same issue I reported that was tagged just above this blocks log4net logging via udp, at least for our use case that we have.

We use the log4net RemoteSyslogAppender to send messages to our graylog instance and a lot of messages are greater than 1500 bytes in size. I have found some hacky ways of using TCP instead but it's definitely going to be a pain to set up and going to eat up more network bandwidth for both the remote, host and container and we would rather use UDP. I imagine there are going to be other similar scenarios for which this issue is going to be a blocker.

crou commented 6 years ago

@mle-ii Looks exactly the kind of trouble that nasty bug involves. Beautiful idea to mixed swarm with windows and linux but without a strong network support in the HNS stack, it will be hard to deploy something else that simple labs/proof of concept.

@kallie-b Do you have an idea if the upcoming windows server build 1709 will include a fix for that? There are improvements to support k8s properly in HNS but how about that important one?

mle-ii commented 6 years ago

@crou FWIW it doesn't even require swarm or mixed environment. I haven't tried Docker/containers on a non-windows OS host, but both the windows containers and linux containers have the same behavior. I was able to get 2 linux containers on the same network to not drop packets when talking to each other, but any UDP data going out of the container network was dropped.

Switch to Linux containers on Windows. Run docker run --rm alpine /bin/ash -c "ifconfig; nc -v -l -u -p 11000" in one console window on the windows host. (I'm using Windows 10 Pro). Get the IP address of the container, it's listening on the port specified. Run docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1453' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_CONTAINER_IP 11000" Notice that the client listening gets the message. (Kill the client, haven't figure out how to make it stop after it gets the message.)

Now find some linux box (not a container running linux) that isn't a host Windows box, get the IP with ifconfig.
Then run the following on the linux box. nc -v -l -u -p 11000

Now on your Windows host with docker where you're using Linux containers run the following: docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1453' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_LINUX_BOX_IP 11000" Notice it doesn't get the message.

Now run this command sending 1 byte smaller in data size. docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1452' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_LINUX_BOX_IP 11000" It gets the message this time. Here though the limit appears to be 1480 in total bytes and not the 1500 bytes, not sure if the alpine/linux version has a different MTU.

(Unrelated I think, but as a note for some reason I couldn't use the bash on my Linux host, though it works fine host to host in multiple bash windows.)

I tried a lot of different things to get the windows containers working but nothing worked for me. Really puzzled how this isn't hit by more people as sending UDP network data out is used by a lot of things that send log data.

mle-ii commented 6 years ago

To follow up it "appears" to be Docker on Windows only. Running similar commands above in the console on the Docker classroom there doesn't seem to be any message drop. Ran it in the console here, though had the listen running in a background job: http://training.play-with-docker.com/helloworld/

Set a background job listening to udp on port 11000: nc -v -l -u -p 11000 &

Get the ipaddress of the node running in the docker classroom: ping -c 1 $(hostname)

Run a container sending a ping to the host from inside the container greater than mtu (which appears to be 1500): docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1500' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_DOCKERCLASSROOM_NODE_IP 11000"

Notice that it gets the udp message that was sent from within the container.

chrisvanderpennen commented 6 years ago

Perhaps the official Microsoft mixed-OS cluster documentation should be updated with a reference to this issue until it's resolved? I have Windows 2016 build 14393.2035 / Docker 17.06.2-ee-6 and Ubuntu 16.04 / Docker 18.01.0-ce in a two-node cluster in Azure, and just spent two days tracking down why nginx and caddy were choking proxying some requests but not others.

Setting:

networks:
  name:
    driver: overlay
    driver_opts:
      com.docker.network.driver.mtu: 1400

in my stack's compose file worked around this.

edit: quoting from above:

There only way I could have the mtu set to 1450 on the windows container was to set it to the daemon level with the mtu option in the daemon.json

seems to still be required, I just checked inside a Windows container and get-netipinterface shows mtu 1500.

moby / moby

Docker Swarm overlay: packet size issue with hybrid hosts (linux / windows) #33596