nat-t port 4500/udp not cleaned up creates issue when swapping cable drivers

nerdalert commented 2 years ago

What happened:

Looks like the vxlan cable driver is binding vxlan-tunnel interface to the well known port 4500 used for NAT-Treaversal and IKE negation when two peers are natted in an IPSec deployment.

The problem this creates is if a user switches from the vxlan cable driver to the ipsec cable driver without rebooting, the vxlan-tunnel interface is still bound by the kernel to 4500 and libreswan is not able to bind to the default 4500/udp port.

vxlan-tunnel configuration:

ip -d link show vxlan-tunnel
16: vxlan-tunnel: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether 8e:df:36:c0:89:a7 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    vxlan id 1000 srcport 0 0 dstport 4500 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

How to reproduce it (as minimally and precisely as possible):

Deploy submariner using the vxlan driver, then switch to the ipsec driver without cleaning up vxlan-tunnel

In the meantime, users can work around this issue by specifying a different port for natt with the flag --nattport=xxx (e.g. --nattport=4700) in the subctl join or by deleting the vxlan-tunnel interface with sudo ip link del vxlan-tunnel to unbind 4500/udp.

@sridhargaddam mentioned there is an epic around a cleanup from deployments which is awesome. Not sure if this is in the planned cleanup but its worth flagging. Happy to help with that work. cc/ @anfredette Thanks!

tpantelis commented 2 years ago

The aforementioned epic is around uninstalling submariner. By "switch to the ipsec driver" I assume you re-ran subctl join? If so then that will delete and recreate the Submariner resource which will invoke the uninstall process in the gateway engine which should include cable driver cleanup.

nerdalert commented 2 years ago

Hi @tpantelis yeah, switching meant rejoining with a new driver type, or tearing down the entire k8s stack would also result in a ghost interface(s) left over. Looks like you're all over it. Feel free to close the bug or if you want it open to track with your PR. Thanks.

sridhargaddam commented 2 years ago

I feel lets keep this open so that we can ensure that we test this use-case (after adding the necessary support). Once we find that its working, we can close it.

sridhargaddam commented 2 years ago

The aforementioned epic is around uninstalling submariner. By "switch to the ipsec driver" I assume you re-ran subctl join? If so then that will delete and recreate the Submariner resource ...

@tpantelis We normally want the subctl join ... commands to be idempotent. If the user is re-running the command with the same flags/configuration, we should not trigger the cleanup. Hope we are on the same page?

tpantelis commented 2 years ago

@tpantelis We normally want the subctl join ... commands to be idempotent. If the user is re-running the command with the same flags/configuration, we should not trigger the cleanup. Hope we are on the same page?

Ideally we shouldn't. subctl join does CreateAnew with a foreground delete and I don't see anywhere it checks if the old and new Submariner differ so it looks like subctl join will always recreate the daemonsets (@skitt). This is regardless of the dataplane cleanup being added.

skitt commented 2 years ago

Ideally we shouldn't. subctl join does CreateAnew with a foreground delete and I don't see anywhere it checks if the old and new Submariner differ so it looks like subctl join will always recreate the daemonsets (@skitt). This is regardless of the dataplane cleanup being added.

That’s right, we always re-deploy the operator. The other deployments shouldn’t be affected though, and the new operator will re-parent them.

tpantelis commented 2 years ago

Ideally we shouldn't. subctl join does CreateAnew with a foreground delete and I don't see anywhere it checks if the old and new Submariner differ so it looks like subctl join will always recreate the daemonsets (@skitt). This is regardless of the dataplane cleanup being added.

That’s right, we always re-deploy the operator. The other deployments shouldn’t be affected though, and the new operator will re-parent them.

I'm not sure I follow. By "other deployments", I assume you mean the gateway, route-agent DaemonSets et al? subctl join always ensures the Submariner CR exists, which does CreateAnew instead of the usual CreateOrUpdate so every time it will delete and recreate the Submariner CR which in turn will delete and recreate the gateway, route-agent DaemonSets et al, regardless whether or not anything actually changed in the Submariner CR. Now with the dataplane cleanup coming, it will also run those DaemonSets. Or am I missing something?

sridhargaddam commented 2 years ago

I'm not sure I follow. By "other deployments", I assume you mean the gateway, route-agent DaemonSets et al? subctl join always ensures the Submariner CR exists, which does CreateAnew instead of the usual CreateOrUpdate so every time it will delete and recreate the Submariner CR which in turn will delete and recreate the gateway, route-agent DaemonSets et al, regardless whether or not anything actually changed in the Submariner CR.

Yes, when subctl join ... is executed all the DaemonSets get restarted. So the question is, do we want the un-installation to be triggered as part of this operation or should it be triggered ONLY when explicitly specified by the user?

skitt commented 2 years ago

I'm not sure I follow. By "other deployments", I assume you mean the gateway, route-agent DaemonSets et al? subctl join always ensures the Submariner CR exists, which does CreateAnew instead of the usual CreateOrUpdate so every time it will delete and recreate the Submariner CR which in turn will delete and recreate the gateway, route-agent DaemonSets et al, regardless whether or not anything actually changed in the Submariner CR. Now with the dataplane cleanup coming, it will also run those DaemonSets. Or am I missing something?

Right, I got things mixed up, and my previous comment was incorrect. The operator doesn’t get redeployed, but everything else does.

tpantelis commented 2 years ago

Yes, when subctl join ... is executed all the DaemonSets get restarted. So the question is, do we want the un-installation to be triggered as part of this operation or should it be triggered ONLY when explicitly specified by the user?

I'm not sure there's anyway the operator can tell the difference. Besides, I think the bigger issue is that we're re-installing the regular DaemonSets and possibly causing disruption. I think we should have CreateAnew compare the new and previous and only recreate if changed.

sridhargaddam commented 2 years ago

I'm not sure there's anyway the operator can tell the difference. Besides, I think the bigger issue is that we're re-installing the regular DaemonSets and possibly causing disruption.

In some of the components (like gateway, globalnet etc) we don't perform cleanup when the Pod gets restarted as the Pod might be restarted for various reasons and as much as possible, we want to avoid disrupting the data-path. So, we should find a way to instruct the pod explicitly that a cleanup is necessary. Can we do this?

tpantelis commented 2 years ago

I'm not sure there's anyway the operator can tell the difference. Besides, I think the bigger issue is that we're re-installing the regular DaemonSets and possibly causing disruption.

In some of the components (like gateway, globalnet etc) we don't perform cleanup when the Pod gets restarted as the Pod might be restarted for various reasons and as much as possible, we want to avoid disrupting the data-path. So, we should find a way to instruct the pod explicitly that a cleanup is necessary. Can we do this?

The plan is to create a DaemonSet that passes an UNINSTALL env var to the pods.

sridhargaddam commented 2 years ago

I think we should have CreateAnew compare the new and previous and only recreate if changed.

Cool.

nyechiel commented 2 years ago

@tpantelis @sridhargaddam so are we thinking the this issue will get resolved as part of the uninstall work?

tpantelis commented 2 years ago

@tpantelis @sridhargaddam so are we thinking the this issue will get resolved as part of the uninstall work?

I think it should as part of the dataplane cleanup - I'll let @sridhargaddam verify.

nyechiel commented 2 years ago

@sridhargaddam can you comment on this issue? are you going to address it as part of the cleanup work?

sridhargaddam commented 2 years ago

Yes @nyechiel, I will look into it as part of dataplane cleanup.

nyechiel commented 2 years ago

Should be covered by the recent datapath cleanup work, but this needs to be tested.

nyechiel commented 2 years ago

There seems to be a valid bug here. Related Slack thread: https://kubernetes.slack.com/archives/C010RJV694M/p1646061293066379

sridhargaddam commented 2 years ago

When switching between the cable-drivers, submariner-operator is scheduling the uninstall and the new Pods in parallel. Steps to reproduce: In Submariner repo

make deploy using=libreswan
Wait for the clusters to be successfully deployed.
make deploy using=vxlan

uninstall-error

tpantelis commented 2 years ago

When switching between the cable-drivers, submariner-operator is scheduling the uninstall and the new Pods in parallel.

The operator waits for the uninstall DeamonSet to be ready then deletes it so the pods may linger for a bit after the Submariner deletion completes depending on how long it takes K8s to GC them. I assume this is what you're seeing or are you actually seeing an issue?

sridhargaddam commented 2 years ago

The operator waits for the uninstall DeamonSet to be ready then deletes it so the pods may linger for a bit after the Submariner deletion completes depending on how long it takes K8s to GC them. I assume this is what you're seeing or are you actually seeing an issue?

The uninstall pods are running for ever. If you look at the top-right side of the image, you can see the Age 4m,18sec for Uninstall pods and 4ms,17secs for normal Pod. So the operator has scheduled the regular pods immediately after scheduling the uninstall pods without waiting (or the 2 mins max threshold) for it to complete its execution.

tpantelis commented 2 years ago

The operator waits for the uninstall DeamonSet to be ready then deletes it so the pods may linger for a bit after the Submariner deletion completes depending on how long it takes K8s to GC them. I assume this is what you're seeing or are you actually seeing an issue?

The uninstall pods are running for ever. If you look at the top-right side of the image, you can see the Age 4m,18sec for Uninstall pods and 4ms,17secs for normal Pod. So the operator has scheduled the regular pods immediately after scheduling the uninstall pods without waiting (or the 2 mins max threshold) for it to complete its execution.

Does the uninstall daemonset still exist? If not, then the pods should be cleaned up by K8s. The operator only removes the finalizer after the uninstall daemonsets are deleted. Until then a new Submariner instance cannot be created in order to create new regular daemonsets. I suggest looking at the operator log. This process is working correctly and verified in the E2E test so I don't know what make deploy is doing to cause this strange behavior.

BTW, the uninstall pods don't do anything in the Running state, ie the cleanup is done in the initializing phase which has to complete before the regular pause container runs and transitions to the Running state. But, even so, the pods should eventually terminate and get cleaned up.

sridhargaddam commented 2 years ago

Does the uninstall daemonset still exist?

Yes. The uninstall daemonSet are never deleted.

If not, then the pods should be cleaned up by K8s.

Agree, but that's not the case here.

The operator only removes the finalizer after the uninstall daemonsets are deleted.

I tried to spend sometime today to look at the flow and some observations are

Reconcile method is getting invoked pretty frequently.
When we re-run the "make deploy using=xxx" (or even while rerunning the "subctl join ... " command) the DeletionTimeStamp is set so "r.runComponentCleanup" is invoked.
In runComponentCleanup method the uninstall DaemonSet pods are deployed and the uninstall code waits for close to 2 mins and invokes removeFinalizer.
removeFinalizer returns success, but in the meantime Reconcile method is invoked again and instance.GetDeletionTimestamp().IsZero() is still present, so cleanup code gets triggered once again which is spawning the uninstall DaemonSets. The next invocation of Reconcile (which as I mentioned is happening very frequently) does not see the DeletionTimeStamp on the Submariner Pod so it goes ahead and spawns the regular pods.
I think after we removed the finalizer, it took few milliseconds for K8s to delete the object but in the meantime when the Reconcile method queried the submariner CR, it finds it along with the DeletionTimeStamp on it, so its invoking uninstall once again.

I'm frequently hitting the docker rate limit while testing this on my host. So, thought of sharing my observations so far.

Until then a new Submariner instance cannot be created in order to create new regular daemonsets. I suggest looking at the operator log. This process is working correctly and verified in the E2E test so I don't know what make deploy is doing to cause this strange behavior.

The behavior I described is seen even when we manually run "subctl join ...." on an existing deployment.

BTW, the uninstall pods don't do anything in the Running state, ie the cleanup is done in the initializing phase which has to complete before the regular pause container runs and transitions to the Running state.

True, but this is causing some race conditions as pods are spawned with a difference of few milli seconds.

But, even so, the pods should eventually terminate and get cleaned up.

This is not happening even if I leave the setup for an hour ;-)

tpantelis commented 2 years ago

removeFinalizer returns success, but in the meantime Reconcile method is invoked again and instance.GetDeletionTimestamp().IsZero() is still present, so cleanup code gets triggered once again which is spawning the uninstall DaemonSets. The next invocation of Reconcile (which as I mentioned is happening very frequently) does not see the DeletionTimeStamp on the Submariner Pod so it goes ahead and spawns the regular pods.

I think after we removed the finalizer, it took few milliseconds for K8s to delete the object but in the meantime when the Reconcile method queried the submariner CR, it finds it along with the DeletionTimeStamp on it, so its invoking uninstall once again.

OK thanks - that points to the issue. I think we should not invoke runComponentCleanup if the finalizer doesn't exist - that should close that timing window. I do have things going on the next couple days on PTO but I can try to push a PR.

tpantelis commented 2 years ago

OK thanks - that points to the issue. I think we should not invoke runComponentCleanup if the finalizer doesn't exist - that should close that timing window. I do have things going on the next couple days on PTO but I can try to push a PR.

Submitted https://github.com/submariner-io/submariner-operator/pull/1899.

sridhargaddam commented 2 years ago

Submitted submariner-io/submariner-operator#1899.

I verified couple of scenarios switching the cable-drivers and also re-running the make deploy with same drivers and confirm that the issue is fixed. Thanks @tpantelis

submariner-io / submariner

nat-t port 4500/udp not cleaned up creates issue when swapping cable drivers #1679