fdevibe commented 3 years ago

We have an issue with semi-large packets being silently dropped on hosts running on VmWare NSX, where the hosts are running vmw_conn_notifyd. If the size of network packets rise above a certain threshold (which seems to be somewhere below 900 bytes, far lower than the MTU of any interface involved), packets are simply lost. We have narrowed this down to vmw_conn_notify, if this is running, the failures below occur. If you cannot reproduce the error with packet sizes of 859, try to increase this number while keeping it below the MTU setting of the interface.

host 1 is the swarm manager.

host 1: docker network create --attachable --driver overlay --scope swarm foo_net docker run -it --name test_server --hostname test_server --rm --network foo_net ubuntu:latest bash

host 2: docker run -it --name test_client --hostname test_client --rm --network foo_net ubuntu:latest bash Then, inside container:

apt update && apt install -y iputils-ping
ping -c 1 -s 858 -M do test_server # this works
ping -c 1 -s 859 -M do test_server # this doesn't

On both hosts (as root, outside containers): /etc/rc.d/init.d/vmw_conn_notifyd stop

host 2: ping -c 1 -s 859 -M do test_server # now works

Versions: AppDefense is v2.3.2.0 ESXi is v6.7.0, build 16773714 (patch release ESXi670-202010001 (14 October) NSX is 6.4.6.14819921

# docker --version
Docker version 19.03.13, build 4484c46d9d
# uname -srvmpio
Linux 3.10.0-1160.2.2.el7.x86_64 #1 SMP Tue Oct 20 16:53:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (Core)
# vmw_conn_notify -v
vmw_conn_notify version :   1.1.0.0

fdevibe commented 3 years ago

Digging a bit futher into this, I see that the larger packets seem to trigger the select() in vmw_netfilter_event_handler(), indicating (if I understand the code somewhat correctly) that they are interpreted as events. However, when I simultaneously run conntrack -E -e ALL, I get no output. I'm by no means an expert with packet filtering nor conntrack, but my interpretation of this would be that the callback(s) aren't set up correctly in vmw_conn_netfilter.c.

fdevibe commented 3 years ago

Further investigation shows that VNET_BUFSIZE (set to 1024) seems to limit the packets and that this is the limiting factor. After increasing this, the larger packets also went through as they should.

svijayvargiy commented 3 years ago

Thanks Fredrik. I will take a look and get back to you by tuesay (15th december)

fdevibe commented 3 years ago

@svijayvargiy Thanks! I have also noticed that these callbacks are not triggered when sending packets between the hosts (and not between containers on the hosts). I'm not sure what the intentions are here, but could it be that the traffic between the containers shouldn't trigger these callbacks at all? In that case, perhaps the problem is in the iptables rules.

fdevibe commented 3 years ago

@svijayvargiy did you have a chance to look into this yet?

I have examined this further, and it seems docker and overlay networks aren't required for this to occur. Looking at the filters created by /etc/rc.d/init.d/vmw_conn_notifyd, I see that the rule

iptables -A INPUT -p udp -m conntrack --ctstate NEW -m mark ! --mark 0x1/0x1 -m comment --comment AppDefense_Iptable_rules -j NFQUEUE --queue-num 0 --queue-bypass

is where these packets hit. Since Docker Swarm overlay traffic uses VXLAN and appears to be sent over UDP, they hit this rule, but ICMP sent directly from host to host won't trigger this rule (since they aren't encapsulated within UDP packets). Now, if I do send UDP packets over the 1024 byte threshold from host to host, I am able to reproduce the issue. Steps to reproduce:

First, trace packets in the raw table on the host used as netcat server:

iptables -t raw -A PREROUTING -p udp --dport 31377 -j TRACE

Next, set up a netcat UDP server:

nc -4vulp 31377

and client:

nc -4vu <server> 31377

with vmw_conn_notifyd running on the server.

I also modified vmw_conn_netfilter.c (optional) by adding some more log output, by adding the line

         NOTICE("vmw_conn_netfilter: got netfilter event: bread: %d\n", (int) bread);

just after

         bread = recv(sess->queue_ctx.qfd, buf, sizeof(buf), 0);

Next, tail the kernel log on the server to see the traced packets:

dmesg --follow

or (required to see the NOTICE messages from the patched vmw_conn_notify):

tail -f /var/log/messages

Now, if you send messages from the client to the server, you will see them showing up in the server session, and you should see the traces. Increase the size of the text block you transmit, and when the received packets reach a size of more than 1024 bytes, they should no longer show up on the server.

With the iptables INPUT chain in the filter table:

Chain INPUT (policy ACCEPT 329K packets, 32M bytes)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     udp  --  lo     *       0.0.0.0/0            0.0.0.0/0            /* AppDefense_Iptable_rules */
    0     0 ACCEPT     tcp  --  lo     *       0.0.0.0/0            0.0.0.0/0            tcp flags:0x3F/0x02 mark match ! 0x7e/0xfe /* AppDefense_Iptable_rules */
45529 3958K NFQUEUE    udp  --  *      *       0.0.0.0/0            0.0.0.0/0            udp spt:53 ctstate ESTABLISHED mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass
   65 41506 NFQUEUE    udp  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass
21312 1279K NFQUEUE    tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp flags:0x3F/0x02 mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass
    7   420 vnetchain  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x1/0x1 tcp flags:0x1F/0x02

I've seen the following traces:

Dec 18 15:15:32 dst-host kernel: TRACE: raw:PREROUTING:policy:5 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058 
Dec 18 15:15:32 dst-host kernel: TRACE: mangle:PREROUTING:policy:1 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058 
Dec 18 15:15:32 dst-host kernel: TRACE: mangle:INPUT:policy:1 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058 UID=55377 GID=10513 
Dec 18 15:15:32 dst-host kernel: TRACE: filter:INPUT:rule:4 IN=ens160 OUT= MAC=<MAC_ADDRESS> SRC=<SRC_HOST> DST=<DST_HOST> LEN=1078 TOS=0x00 PREC=0x00 TTL=64 ID=31105 DF PROTO=UDP SPT=57564 DPT=31377 LEN=1058 UID=55377 GID=10513 
Dec 18 15:15:32 dst-host vmw_conn_notify[17287]: NOTICE: vmw_netfilter_event_handler: vmw_conn_netfilter: got netfilter event: bread: 1024

Here, you can see the length of the packets transmitted are 1058 bytes, while they are truncated by the vmw_conn_netfilter.c buffer to 1024 bytes, and this is as far as it gets, as long as the packets are truncated.

A caveat is that I have seen cases where incoming packets don't hit this rule, presumably because they haven't been treated as new. I am not entirely sure in which cases this will occur, but in that case you would see no rules hitting in the trace, with filter:INPUT:rule:4 being replaced by filter:INPUT:rule:7 (or whichever rulenum is one higher than the number of rules in the INPUT chain in the filter table, 7 in the case above). In that case, the code in vmw_conn_netfilter.c isn't called. I've restarted netcat to again reproduce it with success, but I cannot really explain why I have seen both cases.

svijayvargiy commented 3 years ago

Sorry for the delayed response. The code just reads 1024 bytes for verification and it does not truncate the packets. Since it is easily reproducible at your site, can you just increase buffer size to 4096 (in vmw_conn_netfilter.c) and check whether it is reproducible or not?

svijayvargiy commented 3 years ago

Hi Fredrik,

I have identified a fix for this issue. Please let me know how to deliver the fix. Also, please let me know the following

output of rpm -qa | grep -i vmw
cat /etc/redhat-release

Regards, Shirish

fdevibe commented 3 years ago

@svijayvargiy, what kind of fix is it? I would presume updating the code would be the most sensible way, then we can compile a new version and test, and we'll get the updated packages on the next release.

$ rpm -qa | grep -i vmw
vmw-glx-2.3.2.0-16239797.x86_64
Guest-Introspection-for-VMware-NSX-1.1.0.0-16239797.centos.x86_64

As stated in the issue description, CentOS Linux release 7.9.2009 (Core) is the redhat-release.

For the record, I have tried increasing the buffer size. The result seems to be really slow, with some of the more network-intensive jobs taking more than 30x the time of the same job on a swarm that doesn't have these tools installed. I cannot yet say for sure that this is the problem, but it leads me to suspect that piping all the network traffic through this function is the wrong thing to do.

svijayvargiy commented 3 years ago

The fix is to define VNET_BUFSIZE as follows (file name : vmw_conn_netfilter.c ):

define VNET_BUFSIZE 65536

fdevibe commented 3 years ago

@svijayvargiy I suspect this is only partially a solution to the problem. As stated above, the running time of network intensive tasks increases to above 30x when the introspection tools are running (I've now tested without). Is the intention really that all new UDP packets should enter vmw_netfilter_event_handler()? In the case of overlay networks, this includes all network traffic between nodes, seemingly bogging said traffic down completely. Are UDP packets in general considered "netfilter events"? This seems very strange, what is the actual intention of this function?

To elaborate, the iptables rule

   65 41506 NFQUEUE    udp  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW mark match ! 0x1/0x1 /* AppDefense_Iptable_rules */ NFQUEUE num 0 bypass

seems to force basically all UDP traffic into this function. Oftentimes, UDP packets are smaller than 1024, but sometimes they are not, and in the case of these overlay networks, they somewhat frequently rise above this threshold. Increasing the buffer size will of course help with that, but it will also force a lot of traffic through here, resulting in a prohibitively slow system. At the very least, I presume making it possible to circumvent this rule by e.g. using a whitelist of hosts from which you don't go down this route might be an option, but whether this is a good enough solution would depend on the actual intention of the rule and this function. So, do you have any input on this?

svijayvargiy commented 3 years ago

We are interested only in the first packet of new UDP source and destination pair. Please see if the following steps solve the performance issue (along with the VNET_BUFSIZE buffer size fix mentioned in my previous comment )

/etc/init.d/vmw_glxd stop
make a backup of /usr/sbin/appdef_funcs
do the following change (replace OLD with NEW) in /usr/sbin/appdef_funcs OLD:

rule_ipt[${#rule_ipt[*]}]=“${IPTABLES} -I INPUT 1 -p udp -j NFQUEUE \ --queue-num 0 --queue-bypass -m conntrack --ctstate NEW -m mark ! --mark $mark/$mark \ -m comment --comment ${IPTABLE_COMMENT}”

NEW: rule_ipt[${#rule_ipt[*]}]=“${IPTABLES} -I INPUT 1 -p udp -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2 -j NFQUEUE \ --queue-num 0 --queue-bypass -m conntrack --ctstate NEW -m mark ! --mark $mark/$mark \ -m comment --comment ${IPTABLE_COMMENT}

/etc/init.d/vmw_glxd start

fdevibe commented 3 years ago

@svijayvargiy, thanks, that did actually help quite a bit. I'm now down from >30x to about 4x. This is still too slow, but that might be just the cost of running AppDefense? I also tried circumventing this machinery for all Docker traffic using the following patch:

--- /usr/sbin/appdef_funcs  2021-01-08 20:19:53.983634679 +0100
+++ /usr/sbin/appdef_funcs  2021-01-08 20:43:46.065636937 +0100
@@ -103,6 +103,14 @@
 rule_ipt[${#rule_ipt[*]}]="${IPTABLES} -I OUTPUT 1 -o lo -p udp  \
 -j ACCEPT -m comment --comment ${IPTABLE_COMMENT}"

+if [ -f /etc/gi-whitelist.conf ]; then
+    while read LINE; do
+        [[ -z "$LINE" || "${LINE::1}" == "#" ]] && continue
+        IFS="  *" read PROTO ADDRESS PORT <<< "$LINE"
+        rule_ipt[${#rule_ipt[*]}]="${IPTABLES} -I INPUT 1 -s $ADDRESS -p $PROTO --dport $PORT \
+        -j ACCEPT -m comment --comment ${IPTABLE_COMMENT}_whitelisted"
+    done < /etc/gi-whitelist.conf
+fi

 #Skip ip6tables rule if IPv6 is not enabled
 # Please note that enable/disable of IPv6 requires system reboot so

and then ignoring Docker Swarm traffic:

$ cat /etc/gi-whitelist.conf 
# Ignore packets from the following hosts to the following ports
# PROTOCOL SOURCE_ADDRESS DESTINATION_PORT
udp 192.168.1.0/24 4789
udp 192.168.1.0/24 7946
tcp 192.168.1.0/24 7946
tcp 192.168.1.0/24 2377

This seems to speed things up slightly more, but for some of the tasks, only disabling AppDefense altogether makes it run significantly faster.

I am not sure exactly what you are looking for in vmw_netfilter_event_handler(), if 2 bytes is sufficient, I guess -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2 might be the correct solution for this bug. It still seems too slow in practice, but that doesn't seem directly related to this issue. I really think you should have some sort of whitelisting functionality (probably a bit more extensive than the above), as having /usr/sbin/appdef_funcs consistently rearranging iptables rules is terrible behaviour.

svijayvargiy commented 3 years ago

thanks Fredrik for trying out the steps. We need only first packet. With this rule "--connbytes-mode packets --connbytes-dir both --connbytes 1:2," we are capturing only first 2 two packets (not bytes as connbytes-mode is packets).

svijayvargiy commented 3 years ago

Also, please apply the same change ( -p udp -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2 -j NFQUEUE) for OUTPUT iptables UDP rules as well

fdevibe commented 3 years ago

@svijayvargiy thank you! I tested this yesterday, and now, the execution time is down to the same as without AppDefense running. As such, adding -m connbytes --connbytes-mode packets --connbytes-dir both --connbytes 1:2 to the UDP rules for new, unmarked packets, along with increasing VNET_BUFSIZE seems to solve the issue.

Setting the buffer to 65535 seems a bit pessimistic to me, though, as even jumbo frames are in general limited to 9000, and if there should be significant traffic entering this function, you might end up wasting quite a bit of memory on this. However, I don't know whether this would be an issue or not.

fdevibe commented 3 years ago

@svijayvargiy do you have any update on fixing this? We still have this patched locally, but I assume it will fail again once somebody adds another server or re-installs anything.

svijayvargiy commented 3 years ago

Hi Fredrik, I will patch the fix and let you know.

Regards, Shirish

From: Fredrik de Vibe @.> Reply to: vmware/guest-introspection-nsx @.> Date: Wednesday, 23 June 2021 at 2:13 PM To: vmware/guest-introspection-nsx @.> Cc: Shirish Vijayvargiya @.>, Mention @.***> Subject: Re: [vmware/guest-introspection-nsx] Large packets sent on Docker Swarm overlay networks are dropped (#25)

@svijayvargiyhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsvijayvargiy&data=04%7C01%7Csvijayvargiy%40vmware.com%7C184a881d93534cabe9f708d93622f87e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637600346102074430%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=q%2BIseDS20DP0DrPvXF6cZwkJAYQgv9bwoCR8Dl4E4UQ%3D&reserved=0 do you have any update on fixing this? We still have this patched locally, but I assume it will fail again once somebody adds another server or re-installs anything.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fvmware%2Fguest-introspection-nsx%2Fissues%2F25%23issuecomment-866649339&data=04%7C01%7Csvijayvargiy%40vmware.com%7C184a881d93534cabe9f708d93622f87e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637600346102084387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eWclndnP19PA5nyHjervjaNMjBPD1etxxkprCrnUJGA%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAISRJ4EUM5PNLK4SVV7GERDTUGNC5ANCNFSM4UTNP32Q&data=04%7C01%7Csvijayvargiy%40vmware.com%7C184a881d93534cabe9f708d93622f87e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637600346102084387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Hzx9FHx9kaa8qrIsudVgmazsLkVWawuNMB0xytRrJLI%3D&reserved=0.

vmware / guest-introspection-nsx

Large packets sent on Docker Swarm overlay networks are dropped #25

define VNET_BUFSIZE 65536

do the following change (replace OLD with NEW) in /usr/sbin/appdef_funcs OLD: