Open fdevibe opened 3 years ago
Digging a bit futher into this, I see that the larger packets seem to trigger the select()
in vmw_netfilter_event_handler()
, indicating (if I understand the code somewhat correctly) that they are interpreted as events. However, when I simultaneously run conntrack -E -e ALL
, I get no output. I'm by no means an expert with packet filtering nor conntrack, but my interpretation of this would be that the callback(s) aren't set up correctly in vmw_conn_netfilter.c
.
Further investigation shows that VNET_BUFSIZE
(set to 1024) seems to limit the packets and that this is the limiting factor. After increasing this, the larger packets also went through as they should.
Thanks Fredrik. I will take a look and get back to you by tuesay (15th december)
@svijayvargiy Thanks! I have also noticed that these callbacks are not triggered when sending packets between the hosts (and not between containers on the hosts). I'm not sure what the intentions are here, but could it be that the traffic between the containers shouldn't trigger these callbacks at all? In that case, perhaps the problem is in the iptables rules.
@svijayvargiy did you have a chance to look into this yet?
I have examined this further, and it seems docker and overlay networks aren't required for this to occur. Looking at the filters created by /etc/rc.d/init.d/vmw_conn_notifyd
, I see that the rule
iptables -A INPUT -p udp -m conntrack --ctstate NEW -m mark ! --mark 0x1/0x1 -m comment --comment AppDefense_Iptable_rules -j NFQUEUE --queue-num 0 --queue-bypass
is where these packets hit. Since Docker Swarm overlay traffic uses VXLAN and appears to be sent over UDP, they hit this rule, but ICMP sent directly from host to host won't trigger this rule (since they aren't encapsulated within UDP packets). Now, if I do send UDP packets over the 1024 byte threshold from host to host, I am able to reproduce the issue. Steps to reproduce:
First, trace packets in the raw table on the host used as netcat server:
iptables -t raw -A PREROUTING -p udp --dport 31377 -j TRACE
Next, set up a netcat UDP server:
nc -4vulp 31377
and client:
nc -4vu <server> 31377
with vmw_conn_notifyd
running on the server.
I also modified vmw_conn_netfilter.c
(optional) by adding some more log output, by adding the line
NOTICE("vmw_conn_netfilter: got netfilter event: bread: %d\n", (int) bread);
just after
bread = recv(sess->queue_ctx.qfd, buf, sizeof(buf), 0);
Next, tail the kernel log on the server to see the traced packets:
dmesg --follow
or (required to see the NOTICE messages from the patched vmw_conn_notify
):
tail -f /var/log/messages
Now, if you send messages from the client to the server, you will see them showing up in the server session, and you should see the traces. Increase the size of the text block you transmit, and when the received packets reach a size of more than 1024 bytes, they should no longer show up on the server.
With the iptables INPUT chain in the filter table:
Chain INPUT (policy ACCEPT 329K packets, 32M bytes)
pkts bytes target prot opt in out source destination
0 0 ACCEPT udp -- lo * 0.0.0.0/0 0.0.0.0/0 /* AppDefense_Iptable_rules */
0 0 ACCEPT tcp -- lo * 0.0.0.0/0 0.0.0.0/0 tcp flags:0x3F/0x02 mark match ! 0x7e/0xfe /* AppDefense_Iptable_rules */
45529 3958K NFQUEUE udp -- * * 0.0.0.0/0 0.0.0.0/0 udp spt:53 ctstate ESTABLISHED mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass
65 41506 NFQUEUE udp -- * * 0.0.0.0/0 0.0.0.0/0 ctstate NEW mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass
21312 1279K NFQUEUE tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp flags:0x3F/0x02 mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass
7 420 vnetchain tcp -- * * 0.0.0.0/0 0.0.0.0/0 mark match ! 0x1/0x1 tcp flags:0x1F/0x02
I've seen the following traces:
Dec 18 15:15:32 dst-host kernel: TRACE: raw:PREROUTING:policy:5 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058
Dec 18 15:15:32 dst-host kernel: TRACE: mangle:PREROUTING:policy:1 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058
Dec 18 15:15:32 dst-host kernel: TRACE: mangle:INPUT:policy:1 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058 UID=55377 GID=10513
Dec 18 15:15:32 dst-host kernel: TRACE: filter:INPUT:rule:4 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058 UID=55377 GID=10513
Dec 18 15:15:32 dst-host vmw_conn_notify[17287]: NOTICE: vmw_netfilter_event_handler: vmw_conn_netfilter: got netfilter event: bread: 1024
Here, you can see the length of the packets transmitted are 1058 bytes, while they are truncated by the vmw_conn_netfilter.c
buffer to 1024 bytes, and this is as far as it gets, as long as the packets are truncated.
A caveat is that I have seen cases where incoming packets don't hit this rule, presumably because they haven't been treated as new. I am not entirely sure in which cases this will occur, but in that case you would see no rules hitting in the trace, with filter:INPUT:rule:4
being replaced by filter:INPUT:rule:7
(or whichever rulenum is one higher than the number of rules in the INPUT chain in the filter table, 7 in the case above). In that case, the code in vmw_conn_netfilter.c
isn't called. I've restarted netcat to again reproduce it with success, but I cannot really explain why I have seen both cases.
Sorry for the delayed response. The code just reads 1024 bytes for verification and it does not truncate the packets. Since it is easily reproducible at your site, can you just increase buffer size to 4096 (in vmw_conn_netfilter.c) and check whether it is reproducible or not?
Hi Fredrik,
I have identified a fix for this issue. Please let me know how to deliver the fix. Also, please let me know the following
Regards, Shirish
@svijayvargiy, what kind of fix is it? I would presume updating the code would be the most sensible way, then we can compile a new version and test, and we'll get the updated packages on the next release.
$ rpm -qa | grep -i vmw
vmw-glx-2.3.2.0-16239797.x86_64
Guest-Introspection-for-VMware-NSX-1.1.0.0-16239797.centos.x86_64
As stated in the issue description, CentOS Linux release 7.9.2009 (Core)
is the redhat-release.
For the record, I have tried increasing the buffer size. The result seems to be really slow, with some of the more network-intensive jobs taking more than 30x the time of the same job on a swarm that doesn't have these tools installed. I cannot yet say for sure that this is the problem, but it leads me to suspect that piping all the network traffic through this function is the wrong thing to do.
The fix is to define VNET_BUFSIZE as follows (file name : vmw_conn_netfilter.c ):
@svijayvargiy I suspect this is only partially a solution to the problem. As stated above, the running time of network intensive tasks increases to above 30x when the introspection tools are running (I've now tested without). Is the intention really that all new UDP packets should enter vmw_netfilter_event_handler()
? In the case of overlay networks, this includes all network traffic between nodes, seemingly bogging said traffic down completely. Are UDP packets in general considered "netfilter events"? This seems very strange, what is the actual intention of this function?
To elaborate, the iptables rule
65 41506 NFQUEUE udp -- * * 0.0.0.0/0 0.0.0.0/0 ctstate NEW mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass
seems to force basically all UDP traffic into this function. Oftentimes, UDP packets are smaller than 1024, but sometimes they are not, and in the case of these overlay networks, they somewhat frequently rise above this threshold. Increasing the buffer size will of course help with that, but it will also force a lot of traffic through here, resulting in a prohibitively slow system. At the very least, I presume making it possible to circumvent this rule by e.g. using a whitelist of hosts from which you don't go down this route might be an option, but whether this is a good enough solution would depend on the actual intention of the rule and this function. So, do you have any input on this?
We are interested only in the first packet of new UDP source and destination pair. Please see if the following steps solve the performance issue (along with the VNET_BUFSIZE buffer size fix mentioned in my previous comment )
rule_ipt[${#rule_ipt[*]}]=“${IPTABLES} -I INPUT 1 -p udp -j NFQUEUE \ --queue-num 0 --queue-bypass -m conntrack --ctstate NEW -m mark ! --mark $mark/$mark \ -m comment --comment ${IPTABLE_COMMENT}”
NEW: rule_ipt[${#rule_ipt[*]}]=“${IPTABLES} -I INPUT 1 -p udp -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2 -j NFQUEUE \ --queue-num 0 --queue-bypass -m conntrack --ctstate NEW -m mark ! --mark $mark/$mark \ -m comment --comment ${IPTABLE_COMMENT}
@svijayvargiy, thanks, that did actually help quite a bit. I'm now down from >30x to about 4x. This is still too slow, but that might be just the cost of running AppDefense? I also tried circumventing this machinery for all Docker traffic using the following patch:
--- /usr/sbin/appdef_funcs 2021-01-08 20:19:53.983634679 +0100
+++ /usr/sbin/appdef_funcs 2021-01-08 20:43:46.065636937 +0100
@@ -103,6 +103,14 @@
rule_ipt[${#rule_ipt[*]}]="${IPTABLES} -I OUTPUT 1 -o lo -p udp \
-j ACCEPT -m comment --comment ${IPTABLE_COMMENT}"
+if [ -f /etc/gi-whitelist.conf ]; then
+ while read LINE; do
+ [[ -z "$LINE" || "${LINE::1}" == "#" ]] && continue
+ IFS=" *" read PROTO ADDRESS PORT <<< "$LINE"
+ rule_ipt[${#rule_ipt[*]}]="${IPTABLES} -I INPUT 1 -s $ADDRESS -p $PROTO --dport $PORT \
+ -j ACCEPT -m comment --comment ${IPTABLE_COMMENT}_whitelisted"
+ done < /etc/gi-whitelist.conf
+fi
#Skip ip6tables rule if IPv6 is not enabled
# Please note that enable/disable of IPv6 requires system reboot so
and then ignoring Docker Swarm traffic:
$ cat /etc/gi-whitelist.conf
# Ignore packets from the following hosts to the following ports
# PROTOCOL SOURCE_ADDRESS DESTINATION_PORT
udp 192.168.1.0/24 4789
udp 192.168.1.0/24 7946
tcp 192.168.1.0/24 7946
tcp 192.168.1.0/24 2377
This seems to speed things up slightly more, but for some of the tasks, only disabling AppDefense altogether makes it run significantly faster.
I am not sure exactly what you are looking for in vmw_netfilter_event_handler()
, if 2 bytes is sufficient, I guess -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2
might be the correct solution for this bug. It still seems too slow in practice, but that doesn't seem directly related to this issue. I really think you should have some sort of whitelisting functionality (probably a bit more extensive than the above), as having /usr/sbin/appdef_funcs
consistently rearranging iptables rules is terrible behaviour.
thanks Fredrik for trying out the steps. We need only first packet. With this rule "--connbytes-mode packets --connbytes-dir both --connbytes 1:2," we are capturing only first 2 two packets (not bytes as connbytes-mode is packets).
Also, please apply the same change ( -p udp -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2 -j NFQUEUE) for OUTPUT iptables UDP rules as well
@svijayvargiy thank you! I tested this yesterday, and now, the execution time is down to the same as without AppDefense running. As such, adding -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2
to the UDP rules for new, unmarked packets, along with increasing VNET_BUFSIZE
seems to solve the issue.
Setting the buffer to 65535 seems a bit pessimistic to me, though, as even jumbo frames are in general limited to 9000, and if there should be significant traffic entering this function, you might end up wasting quite a bit of memory on this. However, I don't know whether this would be an issue or not.
@svijayvargiy do you have any update on fixing this? We still have this patched locally, but I assume it will fail again once somebody adds another server or re-installs anything.
Hi Fredrik, I will patch the fix and let you know.
Regards, Shirish
From: Fredrik de Vibe @.> Reply to: vmware/guest-introspection-nsx @.> Date: Wednesday, 23 June 2021 at 2:13 PM To: vmware/guest-introspection-nsx @.> Cc: Shirish Vijayvargiya @.>, Mention @.***> Subject: Re: [vmware/guest-introspection-nsx] Large packets sent on Docker Swarm overlay networks are dropped (#25)
@svijayvargiyhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsvijayvargiy&data=04%7C01%7Csvijayvargiy%40vmware.com%7C184a881d93534cabe9f708d93622f87e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637600346102074430%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=q%2BIseDS20DP0DrPvXF6cZwkJAYQgv9bwoCR8Dl4E4UQ%3D&reserved=0 do you have any update on fixing this? We still have this patched locally, but I assume it will fail again once somebody adds another server or re-installs anything.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fvmware%2Fguest-introspection-nsx%2Fissues%2F25%23issuecomment-866649339&data=04%7C01%7Csvijayvargiy%40vmware.com%7C184a881d93534cabe9f708d93622f87e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637600346102084387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eWclndnP19PA5nyHjervjaNMjBPD1etxxkprCrnUJGA%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAISRJ4EUM5PNLK4SVV7GERDTUGNC5ANCNFSM4UTNP32Q&data=04%7C01%7Csvijayvargiy%40vmware.com%7C184a881d93534cabe9f708d93622f87e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637600346102084387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Hzx9FHx9kaa8qrIsudVgmazsLkVWawuNMB0xytRrJLI%3D&reserved=0.
We have an issue with semi-large packets being silently dropped on hosts running on VmWare NSX, where the hosts are running
vmw_conn_notifyd
. If the size of network packets rise above a certain threshold (which seems to be somewhere below 900 bytes, far lower than the MTU of any interface involved), packets are simply lost. We have narrowed this down tovmw_conn_notify
, if this is running, the failures below occur. If you cannot reproduce the error with packet sizes of 859, try to increase this number while keeping it below the MTU setting of the interface.host 1
is the swarm manager.host 1:
docker network create --attachable --driver overlay --scope swarm foo_net
docker run -it --name test_server --hostname test_server --rm --network foo_net ubuntu:latest bash
host 2:
docker run -it --name test_client --hostname test_client --rm --network foo_net ubuntu:latest bash
Then, inside container:On both hosts (as root, outside containers):
/etc/rc.d/init.d/vmw_conn_notifyd stop
host 2:
ping -c 1 -s 859 -M do test_server # now works
Versions: AppDefense is v2.3.2.0 ESXi is v6.7.0, build 16773714 (patch release ESXi670-202010001 (14 October) NSX is 6.4.6.14819921