Open crou opened 7 years ago
Big/larger request that is looping in a TCP Retransmission pattern:
Compared to a Small request, properly working:
cc @mavenugo @dineshgovindasamy @kallie-b
Hi, still no update? I have the same behavior with hybrid docker swarm running on Azure and on a vmware infrastructure.
Here is a screen capture / demo between 2 linux and 1 windows. Linux to linux is working great, no packet size issue. But on Windows <=> Linux, it works as long as I send packet <= 1450. Any larget packet sent, the is no answers.
@crou -- Apologies for the delay here.
So as I look at your docker service create
commands above, I'm seeing an issue. You're missing the flag that sets dnsrr
as the endpoint mode for your services. Today, the default service endpoint mode (vip
) is not supported on Windows, so you need to make sure to specify the dnsrr
endpoint for all services running on a cluster that includes Windows nodes.
So, try this. Run both of your networks again, but include the --endpoint-mode dnsrr
flag in your command:
docker service create --network mynet --constraint node.labels.os==windows --name winapp --endpoint-mode dnsrr crou/mytest main.exe
docker service create --network mynet --constraint 'node.labels.os == linux' --name httpd --endpoint-mode dnsrr httpd
Let me know if this fixes your issue...
(Also, see our docs for more context on why you must include the dnsrr
specification during service creation today)
Thanks for the answer @kallie-b but that doesn't help. I've tried the service (both on linux and win box) with the dnsrr mode and it doesn't change anything. The ping work up to 1450 bytes. But any packet larger than this, request fails (icmp or tcp).
@crou Okay, thanks. Let me check with my team on this, and I'll get back to you asap!
In the meantime, I've also opened a ticket with Microsoft and after a few message exchanges, I've got the following email from a MS Support Escalation Engineer:
"At this point the PG(product group) has said that this is bug which they need to investigate and fix it. We don’t have any time lines yet as to when this would-be root caused and when the fix would be released. It can take a 2 or 3 weeks and even more than that. "[...]
Well, it doesn't look good so far for Hybrid Linux and Windows Docker Swarm. @kallie-b , let me know if you have more feedback from your team, because even if you demonstrated that an hybrid scenario is possible, pushing just a bit more your lab will confirm that the network is not working between your containers.
@crou thanks for the detailed report. Could you provide a pcap file of the traffic captured on the Linux VM (host) eth0 interface and ideally also a packet capture from the Windows host. The overlay networking (VXLAN) adds an overhead of 50 bytes hence we see fragmentation but it's not clear where things go wrong.
Here are the pcap files.
issue-33596.zip
Both scenarios have been executed by running the command from the linux container to the windows container.
The first working scenario where PING and CURL requests are working WORKING: ping 10.100.1.4 -s 1342 curl http://10.100.1.4:8080 (linux and windows host pcap files)
and a NOT-WORKING scenario where the request exceed the working threshold: ping 10.100.1.4 -s 1343 curl -X POST -H "param1:A............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Z" http://10.100.1.4:8080 --verbose
The capture seems to confirm that the windows host receive the proper request but fails to handle the query to the container.
@rn Did you find anything interesting in the pcap files?
I've received an update from the MS Escalation support engineer regarding the issue. Apparently it's now a bug confirmed on the windows side:
"I had the conversation with pg (Product Group) and they say that they would fix this issue for rs1 . The fix would come possibly in terms of an update. At this point I am not sure how much time it would take but a rough estimation that I have received from PG is 2 months . "[... ]
From what I experimented, as long there is no update from Microsoft, no mixed Docker swarm is fully working (multi host, mixed Linux and windows Containers)
Thanks for the update @crou
@crou did you try to create the overlay network specifying -o com.docker.network.driver.mtu=XXX
where XXX is the MTU that works for both linux and win.
Maybe can be a workaround
@crou
Apologies for the delayed reply. In checking with my team, another way you could try getting around this issue would be to increase the size of the MTU directly on the host network adapters (we recommend doing so by at least 160 Bytes). Of course, to prevent fragmentation with this approach you would also need to increase the MTU on all layer-2 hops (e.g. switch ports) along the impacted data path.
One way to do this could be to enable jumbo frames across your adapters and L2 devices.
@fcrisciani I've tried setting the MTU at the overlay driver with com.docker.network.driver.mtu, but using this path, I can see the MTU is properly applied inside my linux containers, however, there may be something wrong not being passed to the windows host because the mtu inside the windows container is still 1500.
There only way I could have the mtu set to 1450 on the windows container was to set it to the daemon level with the mtu option in the daemon.json
But in both cases, it doesn't help with the communication between mixed containers :-(
@kallie-b , On Azure vms, I've tried changing the mtu at the vm and container level but I had no chance either. Maybe I missed something, I can double-check my settings again but in that case, I don't know how layer-2 hops in Azure will impact, that layer is already an overlay from what I know...
@crou, @fcrisciani cc @kallie-b
Thanks for reporting this issue. We discovered a bug in the Windows platform which does not handle fragmented VXLAN packets correctly. A work-around for this will be for you to update Linux hosts to not send fragmented VXLAN packets (if possible).
Apologies for the inconvenience.
@JMesser81 thanks for the update, can you also investigate why and if the com.docker.network.driver.mtu
is not properly configured on windows side? That would had been a good workaround to avoid fragmentation but as @crou is mentioning here (https://github.com/moby/moby/issues/33596#issuecomment-315734234) looks like windows is not properly enforcing it.
@JMesser81 Do you have an idea how to set linux hosts to not send fragmented VXLAN packets? I've googled a bit without finding the answer...
@JMesser81 @kallie-b , do you have an idea when a fix is expected for that nasty network bug?
I've sent a request for more information on the support ticket I have opened with Microsoft, but still no news.
@crou -- Thanks for the nudge, checking with the team...
@crou -- We aren't able to provide a timeline for this fix right now. But we do need to identify a suitable workaround in the meantime. I'm working with my team to see if one can be identified/fleshed out.
@kallie-b Any workaround found yet? If this is the same issue I reported that was tagged just above this blocks log4net logging via udp, at least for our use case that we have.
We use the log4net RemoteSyslogAppender to send messages to our graylog instance and a lot of messages are greater than 1500 bytes in size. I have found some hacky ways of using TCP instead but it's definitely going to be a pain to set up and going to eat up more network bandwidth for both the remote, host and container and we would rather use UDP. I imagine there are going to be other similar scenarios for which this issue is going to be a blocker.
@mle-ii Looks exactly the kind of trouble that nasty bug involves. Beautiful idea to mixed swarm with windows and linux but without a strong network support in the HNS stack, it will be hard to deploy something else that simple labs/proof of concept.
@kallie-b Do you have an idea if the upcoming windows server build 1709 will include a fix for that? There are improvements to support k8s properly in HNS but how about that important one?
@crou FWIW it doesn't even require swarm or mixed environment. I haven't tried Docker/containers on a non-windows OS host, but both the windows containers and linux containers have the same behavior. I was able to get 2 linux containers on the same network to not drop packets when talking to each other, but any UDP data going out of the container network was dropped.
Switch to Linux containers on Windows.
Run docker run --rm alpine /bin/ash -c "ifconfig; nc -v -l -u -p 11000"
in one console window on the windows host. (I'm using Windows 10 Pro).
Get the IP address of the container, it's listening on the port specified.
Run docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1453' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_CONTAINER_IP 11000"
Notice that the client listening gets the message. (Kill the client, haven't figure out how to make it stop after it gets the message.)
Now find some linux box (not a container running linux) that isn't a host Windows box, get the IP with ifconfig.
Then run the following on the linux box.
nc -v -l -u -p 11000
Now on your Windows host with docker where you're using Linux containers run the following:
docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1453' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_LINUX_BOX_IP 11000"
Notice it doesn't get the message.
Now run this command sending 1 byte smaller in data size.
docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1452' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_LINUX_BOX_IP 11000"
It gets the message this time. Here though the limit appears to be 1480 in total bytes and not the 1500 bytes, not sure if the alpine/linux version has a different MTU.
(Unrelated I think, but as a note for some reason I couldn't use the bash on my Linux host, though it works fine host to host in multiple bash windows.)
I tried a lot of different things to get the windows containers working but nothing worked for me. Really puzzled how this isn't hit by more people as sending UDP network data out is used by a lot of things that send log data.
To follow up it "appears" to be Docker on Windows only. Running similar commands above in the console on the Docker classroom there doesn't seem to be any message drop. Ran it in the console here, though had the listen running in a background job: http://training.play-with-docker.com/helloworld/
Set a background job listening to udp on port 11000:
nc -v -l -u -p 11000 &
Get the ipaddress of the node running in the docker classroom:
ping -c 1 $(hostname)
Run a container sending a ping to the host from inside the container greater than mtu (which appears to be 1500):
docker run --rm alpine /bin/ash -c "ifconfig; printf '%*s' '1500' | tr ' ' 'A' | nc -w 1 -v -u REPLACE_WITH_DOCKERCLASSROOM_NODE_IP 11000"
Notice that it gets the udp message that was sent from within the container.
Perhaps the official Microsoft mixed-OS cluster documentation should be updated with a reference to this issue until it's resolved? I have Windows 2016 build 14393.2035 / Docker 17.06.2-ee-6 and Ubuntu 16.04 / Docker 18.01.0-ce in a two-node cluster in Azure, and just spent two days tracking down why nginx and caddy were choking proxying some requests but not others.
Setting:
networks:
name:
driver: overlay
driver_opts:
com.docker.network.driver.mtu: 1400
in my stack's compose file worked around this.
edit: quoting from above:
There only way I could have the mtu set to 1450 on the windows container was to set it to the daemon level with the mtu option in the daemon.json
seems to still be required, I just checked inside a Windows container and get-netipinterface
shows mtu 1500.
Description
I have a Swarm with 1 linux host (master) and 1 windows 2016 (worker). an Overlay network has been added. A service container running on linux is able to call a service continer on the windows host as long the packet is not larger then 1434bytes. As soon I send a packet (larger HTTP POST request), the linux service retry the connection to the destination and the communication never completes.
I suspected a problem with the MTU on a host but it looks like both are properly set to default 1500.
Steps to reproduce the issue:
docker network create --subnet 10.100.0.0/24 --gateway 10.100.0.1 -d overlay mynet
(I tried with ou without --opt com.docker.network.mtu=xxxx but there is no change)docker service create --network mynet --constraint node.labels.os==windows --name winapp crou/mytest main.exe
docker service create --network mynet --constraint 'node.labels.os == linux' --name httpd httpd
Describe the results you received: From the linux service, this request is not working, never completes (more 1434bytes):
curl -X POST -H "Content-Type: application/json" -H "X-Header1: 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111" -H "X-Header2: 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222333333333333X" -d '{ "description": "this is a test" }' --verbose --progress-bar http://10.100.0.5:8080/login
This linux client never receive a response.
Describe the results you expected:
A response is expected from the windows container.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Both VMs are running on Azure.