submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.44k stars 193 forks source link

MTU issues seen between an OnPrem cluster vs Public cloud #1774

Closed sridhargaddam closed 2 years ago

sridhargaddam commented 2 years ago

While running e2e tests in an OCP Cluster where one of the cluster is on OpenStack (OnPrem) and the other is on Public cloud (AWS), it was seen that the initial connection goes through, but there are errors during data transmission.

      Expected
          <string>: listening on 0.0.0.0:1234 ...
          connect to 10.231.0.98:1234 from 242.0.0.4:46183 (242.0.0.4:46183)

      to contain substring
          <string>: b5539269-607a-4893-bd05-3981b1a7d985

As it can be seen above, the client was able to connect to the service but the data transmission failed. Submariner e2e tests validate if the connection is successfully established and it also verifies if the data transmission includes the unique UUID from the client, hence the e2e tests are failing.

Setup details:

OpenShift version: 4.10.2 Cloud platform: OnPrem (OSP) vs Public Cloud (AWS) CNI: OpenShift SDN Submariner Globalnet Enabled Submariner version: 0.12.0

Additional notes:

So it is clear that its an MTU related issue and on debugging it further, it turns out that its because of iptables vs nft differences. Submariner MTU driver in route-agent programs iptable rules to clamp tcp mss to PMTU in the Mangle table. Since the underlying platform is running nftables, the automatic translation layer should ideally translate them to corresponding nftable rules, but it appears like this conversion is not happening properly.

IPTable rules programmed on the Gateway node (mangle table):

Chain SUBMARINER-POSTROUTING (1 references)
target     prot opt source               destination         
TCPMSS     tcp  --  0.0.0.0/0            0.0.0.0/0            match-set SUBMARINER-LOCALCIDRS src match-set SUBMARINER-REMOTECIDRS dst tcp flags:0x06/0x02 TCPMSS clamp to PMTU
TCPMSS     tcp  --  0.0.0.0/0            0.0.0.0/0            match-set SUBMARINER-REMOTECIDRS src match-set SUBMARINER-LOCALCIDRS dst tcp flags:0x06/0x02 TCPMSS clamp to PMTU

Translated nft rules dumped using the nft binary (nft -a list table mangle) on the same node:

chain SUBMARINER-POSTROUTING {
        meta l4proto tcp # match-set SUBMARINER-LOCALCIDRS src # match-set SUBMARINER-REMOTECIDRS dst tcp flags & (syn|rst) == syn counter packets 114 bytes 6840 tcp option maxseg size set rt mtu
        meta l4proto tcp # match-set SUBMARINER-REMOTECIDRS src # match-set SUBMARINER-LOCALCIDRS dst tcp flags & (syn|rst) == syn counter packets 42 bytes 2520 tcp option maxseg size set rt mtu
}

If we look at the nft rules carefully, the ....# match-set SUBMARINER-REMOTECIDRS... is commented out as the iptables mtu target/expressions are not supported by the automatic translation. The problem is described further in this issue - https://unix.stackexchange.com/questions/672742/why-mss-clamping-in-iptables-nft-seems-to-take-no-effect-in-nftables

sridhargaddam commented 2 years ago

Just for testing, when I deleted the iptable rules in SUBMARINER-POSTROUTING chain of MANGLE table and added the following generic nft tcp mss clamp rules, the e2e tests are running fine.

nft add rule ip mangle POSTROUTING tcp flags syn tcp option maxseg size set rt mtu
sridhargaddam commented 2 years ago

To fix the issue for Globalnet deployments, we require two changes.

  1. To ensure that SUBMARINER_LOCALCIDRS ipset is updated with the local PodCIDRs as SNAT (with globalIP) is performed in NAT Postrouting table which is hit after MANGLE POSTROUTING table.
    
    Currently, the IPsets have only GlobalCIDRs
    [root@asuryanarhos2-d5v94-worker-0-tjkr5 submariner]# ipset list
    Name: SUBMARINER-REMOTECIDRS
    Type: hash:net
    Revision: 6
    Header: family inet hashsize 1024 maxelem 65536
    Size in memory: 440
    References: 2
    Number of entries: 1
    Members:
    242.0.0.0/16

Name: SUBMARINER-LOCALCIDRS Type: hash:net Revision: 6 Header: family inet hashsize 1024 maxelem 65536 Size in memory: 504 References: 2 Number of entries: 2 Members: 242.1.0.0/16


Note: The above change is necessary only for Globalnet deployments.

2. Identify the implementation used on the node and if its nftables, use the nft binary to program the tcp-mss-clamp rules.
yboaron commented 2 years ago

Two more observations about the current implementation :

  1. Current implementation enables PLPMTU for all connections on node, by setting /proc/sys/net/ipv4/tcp_mtu_probing = 2 .

    • We might have a conflict with another component that managed this file (like MCO in OCP)
    • We force PLMTUD for all TCP connections originated on this node.
  2. In case icmp net_unreach msgs are blocked by transit routers, tcp_mtu_probing=2 + MSS clamping rule configured on GW node won't help for traffic generated on non-GW nodes .

sridhargaddam commented 2 years ago

@mkimuram FYI

mkimuram commented 2 years ago

I haven't caught up completely yet. However, I tested on clean installation of RHEL9 beta and confirmed that translation fails for https://github.com/submariner-io/submariner/blob/devel/pkg/routeagent_driver/handlers/mtu/mtuhandler.go#L79-L94 rule.

# dnf install -y iptables-nft
# iptables-translate -V
iptables-translate v1.8.7 (nf_tables)

# modprobe xt_set

# dnf install -y ipset
# ipset create SUBMARINER-LOCALCIDRS  nethash
# ipset add SUBMARINER-LOCALCIDRS 242.1.0.0/16
# ipset create SUBMARINER-REMOTECIDRS  nethash
# ipset add SUBMARINER-REMOTECIDRS 242.0.0.0/16

# iptables-translate -t mangle -A POSTROUTING -m set --match-set SUBMARINER-LOCALCIDRS src -m set --match-set SUBMARINER-REMOTECIDRS dst -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu 
nft # -t mangle -A POSTROUTING -m set --match-set SUBMARINER-LOCALCIDRS src -m set --match-set SUBMARINER-REMOTECIDRS dst -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu 

On the other hand, https://fossies.org/linux/iptables/extensions/libxt_TCPMSS.txlate seems working in the same environment

# iptables-translate -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
nft add rule ip filter FORWARD tcp flags & (syn|rst) == syn counter tcp option maxseg size set rt mtu

Is it due to failure in translating rules including ipset?

Because,

Note that there seem libxt_SET.c and libxt_TCPMSS.c, but not sure they work correctly if used together. https://git.netfilter.org/iptables/tree/extensions

mkimuram commented 2 years ago

In my centos8.2 environment, below also succeeds, so I guess the rule with --clamp-mss-to-pmtu could be translated even in RHEL8 (not sure if it is true for all minor releases, though).

# iptables-translate -V
iptables-translate v1.8.4 (nf_tables)
# iptables-translate -t mangle -A POSTROUTING -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
nft add rule ip mangle POSTROUTING tcp flags & (syn|rst) == syn counter tcp option maxseg size set rt mtu
sridhargaddam commented 2 years ago

Is it due to failure in translating rules including ipset?

Thanks @mkimuram for looking into it and sharing your observations. As you know, we use ipsets even in Globalnet, so today in another similar setup, i verified Globalnet use-case that uses ipsets in the rules. I see that Globalnet Egress rules for Namespace (where we use ipsets) are also commented out on the node [iptables v1.8.7 (nf_tables)] when using nft, but the rules are getting hit and working as expected.

Output of nft list ruleset

    chain SM-GN-EGRESS-NS {
        # match-set SM-GN-J3JGDJY5EHAINM244WDEGL64S src mark and 0xc0000 == 0xc0000 counter packets 0 bytes 0 snat to 242.0.255.252
    }

Output from iptables list (you can see the packet count of 4). Our e2e test cases are passing too.

Chain SM-GN-EGRESS-NS (1 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        4   240 SNAT       all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set SM-GN-J3JGDJY5EHAINM244WDEGL64S src mark match 0xc0000/0xc0000 to:242.0.255.252

In the following link, it is mentioned that Whatever nft displays back here is only for the display and mustn't be taken as the actual rules. Along with this, it also mentions the following

While there's no automatic translation [available](https://manpages.debian.org/nft.8#EXTENSION_HEADER_EXPRESSIONS) currently for -m tcpmss --mss , the feature is available: tcp option maxseg size which can be used either as an expression (the equivalent of the match -m tcpmss --mss) or with set as a statement (the equivalent of the target -j TCPMSS). Below would be the result of such translation (and might be in the future once the translation engine is improved):
yboaron commented 2 years ago

I checked (OCP on OSP env) iptables counters with route-agent image including the SUBMARINER_LOCALCIDRS ipset fix [1].

The counters values of the outbound iptables rule seem to be growing when I'm trying to curl IP address from remote cidrs range, though match-set on relevant nft rule is still commented out [2].

Do you think pmtu rule is being applied?

[1] https://github.com/submariner-io/submariner/pull/1805 [2]

        chain SUBMARINER-POSTROUTING { # handle 21
                meta l4proto tcp # match-set SUBMARINER-LOCALCIDRS src # match-set SUBMARINER-REMOTECIDRS dst tcp flags & (syn|rst) == syn counter packets 74 bytes 4440 tcp option maxseg size set rt mtu # handle 36
                meta l4proto tcp # match-set SUBMARINER-REMOTECIDRS src # match-set SUBMARINER-LOCALCIDRS dst tcp flags & (syn|rst) == syn counter packets 20 bytes 1200 tcp option maxseg size set rt mtu # handle 37
        }
}
sridhargaddam commented 2 years ago

The counters values of the outbound iptables rule seem to be growing when I'm trying to curl IP address from remote cidrs range, though match-set on relevant nft rule is still commented out [2]. Do you think pmtu rule is being applied?

When a rule has multiple match expressions and an action, sometimes one of the match expressions might be commented out. The counters are one way for us to know if the rule is getting applied (even partially). Please check if e2e is passing.

skitt commented 2 years ago

This is only partially fixed, reopening.