[BUG] NatGateway ceases working after being restarted

SkalaNetworks commented 1 month ago

Kube-OVN Version

v1.13.0 and v1.12.x

Kubernetes Version

v1.28.3 on k0s Kube-proxy in IPVS mode (iptables seemed to make kube-ovn not work correctly)

Operation-system/Kernel Version

Debian GNU/Linux 12 (bookworm) 6.1.0-18-amd64

Description

The NatGateway ceases to ingress any traffic after being restarted. Pods using the NatGateway lose external connectivity.

Steps To Reproduce

Deploy the following YAML on the cluster with kube-ovn installed and multus installed

---
kind: Vpc
apiVersion: kubeovn.io/v1
metadata:
name: test-vpc-2
spec:
staticRoutes:
- cidr: 0.0.0.0/0
  nextHopIP: 10.0.1.70
  policy: policyDst
---
kind: Subnet
apiVersion: kubeovn.io/v1
metadata:
name: net2
spec:
vpc: test-vpc-2
cidrBlock: 10.0.1.0/24
protocol: IPv4
---
kind: VpcNatGateway
apiVersion: kubeovn.io/v1
metadata:
name: gw1
spec:
vpc: test-vpc-2
subnet: net2
lanIp: 10.0.1.70
selector:
- "kubernetes.io/os: linux"
externalSubnets:
- ovn-vpc-external-network
---
kind: IptablesEIP
apiVersion: kubeovn.io/v1
metadata:
name: eip-static
spec:
natGwDp: gw1
---
kind: IptablesSnatRule
apiVersion: kubeovn.io/v1
metadata:
name: snat01
spec:
eip: eip-static
internalCIDR: 10.0.1.0/24
---
apiVersion: v1
kind: Pod
metadata:
annotations:
ovn.kubernetes.io/logical_switch: net2
namespace: ns2
name: vpc1-pod
spec:
containers:
- name: vpc1-pod
  image: docker.io/library/nginx:alpine
---
apiVersion: v1
kind: Pod
metadata:
annotations:
ovn.kubernetes.io/logical_switch: net2
namespace: ns2
name: vpc2-pod
spec:
containers:
- name: vpc2-pod
  image: docker.io/library/nginx:alpine

Ping 1.1.1.1 from the GW
Ping 1.1.1.1 from one of the 2 pods
Observe (tcpdump) the traffic on the gateway while pinging
Delete the nat-gateway pod and wait for the STS to Terminate and then Restart the pod Pinging continues to work while the pod is Terminating (as expected)
Observe the ping stop on the pod that was running it within the VPC
Observe the tcpdump on the gateway No response is received Pinging 1.1.1.1 from the gateway directly still works
SSH to the K8S node hosting the gateway pod and run a tcpdump
Wait a variable amount of time (10 minutes? sometimes 1 hour?) and rerun the ping from one of the test pods
Look at the (normal) traffic on the node hosting the gateway

Current Behavior

The NatGateway ceases to work for random amounts of time (sometimes indefinitely) when restarted. Deleting everything (subnet, vpc, pods, gateway...) doesn't always fix the problem.

My tests are done WITHOUT the gateway moving from one node to another on restart (it's pinned to a node)

Expected Behavior

Restarting the pod leads to a connection downtime equal to the downtime of the pod.

SkalaNetworks commented 1 month ago

Has anyone been able to reproduce the issue? I can't pinpoint what's causing it, the behaviour is extremely strange and variable. If anyone has got any clues of where I could look for problems during the ping outages, I'll gladly take it.

bobz965 commented 1 month ago

does your nat gw pod has the iptables rules after restarted?

SkalaNetworks commented 1 month ago

Hi @bobz965

This is the iptables rules before the restart of the GW. Note there is one SNAT and a Floating IP, therefore 3 rules.

I kill the pod of the gateway and wait for it to restart, the ping ceases to work.

Here's the rules after:

bobz965 commented 1 month ago

do you have any other natgw pod, which has nothing (dnat snat fip) in the pod? just delete the pod.

SkalaNetworks commented 3 weeks ago

Hi @bobz965

The entire cluster only has one NAT GW, I checkd with crictl ps to check for "zombie" containers not tracked by K8S and I don't see any other gateway present

SkalaNetworks commented 3 weeks ago

I also just checked for zombie processes running "ps aux | grep "sleep 10000" because that's what the GW is doing all day along, and none came up apart from my single and only NAT-GW.

Right now I have a test cluster where that gateway simply doesn't route any traffic.

Here's the iptables:

zhangzujian commented 3 weeks ago

iptables seemed to make kube-ovn not work correctly

@SkalaNetworks What's the problem? Could you please provide some more details?

SkalaNetworks commented 3 weeks ago

@zhangzujian I honestly didn't dig further but when my cluster (k0s) was installed with Kube-proxy in IPTables mode (it is now IPVS), I couldn't get ANY of the pods to ping each others, the CNI was entirely broken and I just switched to see if IPVS would work better. It immediatly resolved my issues, except for the one I'm writing about right now. I don't know if it might be some type of symptom.

zhangzujian commented 3 weeks ago

I couldn't get ANY of the pods to ping each others

What address did you use? Did you ping the service ip or pod ip?

SkalaNetworks commented 3 weeks ago

I tried pinging the IPs of each Pod in a custom VPC, basically they couldn't reach each other in any direction. I vaguely remember a lot of the kube-ovn components being not ready. I could try switching kube-proxy to iptables again and see if it all breaks, it doesn't cost me much to do it, natgateways are broken anyway

zhangzujian commented 3 weeks ago

Observe the ping stop on the pod that was running it within the VPC

The problem is related to conntrack:

vpc-nat-gw-gw1-0:/kube-ovn# conntrack -p icmp -L
icmp     1 12 src=10.0.1.2 dst=192.168.73.1 type=8 code=0 id=63 [UNREPLIED] src=192.168.73.1 dst=10.0.1.2 type=0 code=0 id=63 mark=0 use=1
icmp     1 29 src=10.0.1.2 dst=192.168.73.1 type=8 code=0 id=64 src=192.168.73.1 dst=172.19.0.11 type=0 code=0 id=36062 mark=0 use=1

Packets of the stopped ping hit the first conntrack entry, which does not do SNAT.

Once you start a new ping, a new conntrack entry with SNAT (the second one above) will be created and work as expected.

There are two possible methods to fix it:

Prevent serving traffic before routes and iptables rules are configured;
Flush conntrack entries without SNAT/DNAT after routes and iptables rules are configured.

@bobz965 What do you think?

SkalaNetworks commented 3 weeks ago

Do you think this might be related to my problem with the NAT-GW?

zhangzujian commented 3 weeks ago

Do you think this might be related to my problem with the NAT-GW?

Seems the conntrack entry is performing SNAT. Does the ping still not receive replies?

SkalaNetworks commented 3 weeks ago

I wish it did, out of 1892 packets, 27 were sent, the behaviour is extremely erratic, if you've got commands to debug what OVN is doing, I would be glad

NOTE: Ping works great directly from the GW

bobz965 commented 3 weeks ago

I wish it did, out of 1892 packets, 27 were sent, the behaviour is extremely erratic, if you've got commands to debug what OVN is doing, I would be glad

NOTE: Ping works great directly from the GW

how could you know the source packets to 1.1.1.1 are lost in vpc-nat-gw pod?

SkalaNetworks commented 3 weeks ago

I can see them on the tcpdump in the gw (see before), something is happening between the gateway and the node. I don't know what, whether it's bridge related, OVN related, or iptables related, but the packets coming back from 1.1.1.1 are either:

not received on the node at all, as if 1.1.1.1 was not responding
sent back to the GW but lost somewhere inbetween

I doubt the problem can be somewhere else as pinging directly from the GW works all the time. So I fail to see how it could be a connectivity issue beyond kube-ovn

bobz965 commented 3 weeks ago

Observe the ping stop on the pod that was running it within the VPC

The problem is related to conntrack:
vpc-nat-gw-gw1-0:/kube-ovn# conntrack -p icmp -L
icmp     1 12 src=10.0.1.2 dst=192.168.73.1 type=8 code=0 id=63 [UNREPLIED] src=192.168.73.1 dst=10.0.1.2 type=0 code=0 id=63 mark=0 use=1
icmp     1 29 src=10.0.1.2 dst=192.168.73.1 type=8 code=0 id=64 src=192.168.73.1 dst=172.19.0.11 type=0 code=0 id=36062 mark=0 use=1
Packets of the stopped ping hit the first conntrack entry, which does not do SNAT.

Once you start a new ping, a new conntrack entry with SNAT (the second one above) will be created and work as expected.

There are two possible methods to fix it:

Prevent serving traffic before routes and iptables rules are configured;

Flush conntrack entries without SNAT/DNAT after routes and iptables rules are configured.

@bobz965 What do you think?

the image(step 10). i think the packets snat and dnat is normal. so the vpc nat gw pod snat and dnat is normal.

the dnat is happened, why you could not tcpdump the packets inside the pod? after all, we saw the dnat and snat packets

bobz965 commented 3 weeks ago

@zhangzujian , I think you are right.

SkalaNetworks commented 3 weeks ago

Observe the ping stop on the pod that was running it within the VPC

The problem is related to conntrack:
vpc-nat-gw-gw1-0:/kube-ovn# conntrack -p icmp -L
icmp     1 12 src=10.0.1.2 dst=192.168.73.1 type=8 code=0 id=63 [UNREPLIED] src=192.168.73.1 dst=10.0.1.2 type=0 code=0 id=63 mark=0 use=1
icmp     1 29 src=10.0.1.2 dst=192.168.73.1 type=8 code=0 id=64 src=192.168.73.1 dst=172.19.0.11 type=0 code=0 id=36062 mark=0 use=1
Packets of the stopped ping hit the first conntrack entry, which does not do SNAT. Once you start a new ping, a new conntrack entry with SNAT (the second one above) will be created and work as expected. There are two possible methods to fix it:

Prevent serving traffic before routes and iptables rules are configured;

Flush conntrack entries without SNAT/DNAT after routes and iptables rules are configured.

@bobz965 What do you think?
the image(step 10). i think the packets snat and dnat is normal. so the vpc nat gw pod snat and dnat is normal.

the dnat is happened, why you could not tcpdump the packets inside the pod? after all, we saw the dnat and snat packets

The screenshot you are quoting is AFTER the SNAT starts working again after some time for seemingly no specific reason. Which is why it is prefixed by this text:

Look at the (normal) traffic on the node hosting the gateway

bobz965 commented 3 weeks ago

ok, the nat gw pod disabled its arp until its routes and eip is ready.

SkalaNetworks commented 3 weeks ago

The thing is, it sometimes takes several minutes or even sometimes never starts working again. Could there be a faulty logic in the ARP enabling mechanism making it deadlock?

Also what do you mean by "routes ready", does it wait for the iptables to be appended in the pod and for the EIP to be added on the interface before enabling ARP?

bobz965 commented 3 weeks ago

after nat-gw pod deleted and restarted.

routes ready means net1 nic default routes is configured. and then eip will be appended to net1, and then the arp is on. the net1 eip arping is accessible.

at first, the purpose of turn off the net1 arp is to make sure no arp proxy.

kubeovn / kube-ovn