Closed sridhargaddam closed 2 years ago
Just for testing, when I deleted the iptable rules in SUBMARINER-POSTROUTING
chain of MANGLE
table and added the following generic nft
tcp mss clamp rules, the e2e tests are running fine.
nft add rule ip mangle POSTROUTING tcp flags syn tcp option maxseg size set rt mtu
To fix the issue for Globalnet deployments, we require two changes.
Currently, the IPsets have only GlobalCIDRs
[root@asuryanarhos2-d5v94-worker-0-tjkr5 submariner]# ipset list
Name: SUBMARINER-REMOTECIDRS
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 440
References: 2
Number of entries: 1
Members:
242.0.0.0/16
Name: SUBMARINER-LOCALCIDRS Type: hash:net Revision: 6 Header: family inet hashsize 1024 maxelem 65536 Size in memory: 504 References: 2 Number of entries: 2 Members: 242.1.0.0/16
Note: The above change is necessary only for Globalnet deployments.
2. Identify the implementation used on the node and if its nftables, use the nft binary to program the tcp-mss-clamp rules.
Two more observations about the current implementation :
Current implementation enables PLPMTU for all connections on node, by setting /proc/sys/net/ipv4/tcp_mtu_probing = 2 .
In case icmp net_unreach msgs are blocked by transit routers, tcp_mtu_probing=2 + MSS clamping rule configured on GW node won't help for traffic generated on non-GW nodes .
@mkimuram FYI
I haven't caught up completely yet. However, I tested on clean installation of RHEL9 beta and confirmed that translation fails for https://github.com/submariner-io/submariner/blob/devel/pkg/routeagent_driver/handlers/mtu/mtuhandler.go#L79-L94 rule.
# dnf install -y iptables-nft
# iptables-translate -V
iptables-translate v1.8.7 (nf_tables)
# modprobe xt_set
# dnf install -y ipset
# ipset create SUBMARINER-LOCALCIDRS nethash
# ipset add SUBMARINER-LOCALCIDRS 242.1.0.0/16
# ipset create SUBMARINER-REMOTECIDRS nethash
# ipset add SUBMARINER-REMOTECIDRS 242.0.0.0/16
# iptables-translate -t mangle -A POSTROUTING -m set --match-set SUBMARINER-LOCALCIDRS src -m set --match-set SUBMARINER-REMOTECIDRS dst -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
nft # -t mangle -A POSTROUTING -m set --match-set SUBMARINER-LOCALCIDRS src -m set --match-set SUBMARINER-REMOTECIDRS dst -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
On the other hand, https://fossies.org/linux/iptables/extensions/libxt_TCPMSS.txlate seems working in the same environment
# iptables-translate -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
nft add rule ip filter FORWARD tcp flags & (syn|rst) == syn counter tcp option maxseg size set rt mtu
Is it due to failure in translating rules including ipset?
Because,
# iptables-translate -A POSTROUTING -m set --match-set SUBMARINER-LOCALCIDRS src -m set --match-set SUBMARINER-REMOTECIDRS dst -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
nft # -A POSTROUTING -m set --match-set SUBMARINER-LOCALCIDRS src -m set --match-set SUBMARINER-REMOTECIDRS dst -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
# iptables-translate -t mangle -A POSTROUTING -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
nft add rule ip mangle POSTROUTING tcp flags & (syn|rst) == syn counter tcp option maxseg size set rt mtu
Note that there seem libxt_SET.c and libxt_TCPMSS.c, but not sure they work correctly if used together. https://git.netfilter.org/iptables/tree/extensions
In my centos8.2 environment, below also succeeds, so I guess the rule with --clamp-mss-to-pmtu
could be translated even in RHEL8 (not sure if it is true for all minor releases, though).
# iptables-translate -V
iptables-translate v1.8.4 (nf_tables)
# iptables-translate -t mangle -A POSTROUTING -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
nft add rule ip mangle POSTROUTING tcp flags & (syn|rst) == syn counter tcp option maxseg size set rt mtu
Is it due to failure in translating rules including ipset?
Thanks @mkimuram for looking into it and sharing your observations. As you know, we use ipsets even in Globalnet, so today in another similar setup, i verified Globalnet use-case that uses ipsets in the rules. I see that Globalnet Egress rules for Namespace (where we use ipsets) are also commented out on the node [iptables v1.8.7 (nf_tables)] when using nft, but the rules are getting hit and working as expected.
Output of nft list ruleset
chain SM-GN-EGRESS-NS {
# match-set SM-GN-J3JGDJY5EHAINM244WDEGL64S src mark and 0xc0000 == 0xc0000 counter packets 0 bytes 0 snat to 242.0.255.252
}
Output from iptables
list (you can see the packet count of 4). Our e2e test cases are passing too.
Chain SM-GN-EGRESS-NS (1 references)
num pkts bytes target prot opt in out source destination
1 4 240 SNAT all -- * * 0.0.0.0/0 0.0.0.0/0 match-set SM-GN-J3JGDJY5EHAINM244WDEGL64S src mark match 0xc0000/0xc0000 to:242.0.255.252
In the following link, it is mentioned that Whatever nft displays back here is only for the display and mustn't be taken as the actual rules
. Along with this, it also mentions the following
While there's no automatic translation [available](https://manpages.debian.org/nft.8#EXTENSION_HEADER_EXPRESSIONS) currently for -m tcpmss --mss , the feature is available: tcp option maxseg size which can be used either as an expression (the equivalent of the match -m tcpmss --mss) or with set as a statement (the equivalent of the target -j TCPMSS). Below would be the result of such translation (and might be in the future once the translation engine is improved):
I checked (OCP on OSP env) iptables counters with route-agent image including the SUBMARINER_LOCALCIDRS ipset fix [1].
The counters values of the outbound iptables rule seem to be growing when I'm trying to curl IP address from remote cidrs range, though match-set on relevant nft rule is still commented out [2].
Do you think pmtu rule is being applied?
[1] https://github.com/submariner-io/submariner/pull/1805 [2]
chain SUBMARINER-POSTROUTING { # handle 21
meta l4proto tcp # match-set SUBMARINER-LOCALCIDRS src # match-set SUBMARINER-REMOTECIDRS dst tcp flags & (syn|rst) == syn counter packets 74 bytes 4440 tcp option maxseg size set rt mtu # handle 36
meta l4proto tcp # match-set SUBMARINER-REMOTECIDRS src # match-set SUBMARINER-LOCALCIDRS dst tcp flags & (syn|rst) == syn counter packets 20 bytes 1200 tcp option maxseg size set rt mtu # handle 37
}
}
The counters values of the outbound iptables rule seem to be growing when I'm trying to curl IP address from remote cidrs range, though match-set on relevant nft rule is still commented out [2]. Do you think pmtu rule is being applied?
When a rule has multiple match expressions and an action, sometimes one of the match expressions might be commented out. The counters are one way for us to know if the rule is getting applied (even partially). Please check if e2e is passing.
This is only partially fixed, reopening.
While running e2e tests in an OCP Cluster where one of the cluster is on OpenStack (OnPrem) and the other is on Public cloud (AWS), it was seen that the initial connection goes through, but there are errors during data transmission.
As it can be seen above, the client was able to connect to the service but the data transmission failed. Submariner e2e tests validate if the connection is successfully established and it also verifies if the data transmission includes the unique UUID from the client, hence the e2e tests are failing.
Setup details:
OpenShift version: 4.10.2 Cloud platform: OnPrem (OSP) vs Public Cloud (AWS) CNI: OpenShift SDN Submariner Globalnet Enabled Submariner version: 0.12.0
Additional notes:
So it is clear that its an MTU related issue and on debugging it further, it turns out that its because of iptables vs nft differences. Submariner MTU driver in route-agent programs iptable rules to clamp tcp mss to PMTU in the Mangle table. Since the underlying platform is running nftables, the automatic translation layer should ideally translate them to corresponding nftable rules, but it appears like this conversion is not happening properly.
IPTable rules programmed on the Gateway node (mangle table):
Translated nft rules dumped using the nft binary (
nft -a list table mangle
) on the same node:If we look at the nft rules carefully, the
....# match-set SUBMARINER-REMOTECIDRS...
is commented out as the iptables mtu target/expressions are not supported by the automatic translation. The problem is described further in this issue - https://unix.stackexchange.com/questions/672742/why-mss-clamping-in-iptables-nft-seems-to-take-no-effect-in-nftables