Open denis-tingaikin opened 4 months ago
Provided by @szvincze
It looks like vpp
reuses indices for interfaces. It somehow affects the deletion of tap
interfaces on Closes. Therefore, NSM shows no created tap
interfaces but vpp
still have them
govpp
or vpp
- 12h
govpp
doesn't delete some intefaces - 3hTotal: 35h The estimation may be increased
/cc @szvincze
forwarder-vpp
has two controlplane problems and one dataplane problem.
Controlplane problems:
timeout
chain element doesn't call Close
and doesn't delete vpp
intefacesforwarder-vpp
calls Close
some vxlan
intefaces are not deleted anyway
Issue: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/1129Dataplane problems:
forwarder-vpp
calls Close
some tap
interfaces are not deleted, just switched to state: down
. After investigation I found out that the problem is inside vpp
Hi, @NikitaSkrynnik :Could you please help our investigation with the Dataplane Prblems you mentioned above and point to VPP? I loaded VPP (v23.10-rc0 local build) API with vxlan tunnel creates, tap creates, ACL xConnect create in batch for ~24 hours and was not able to reproduce such issue. I tried to use the same go API calls that are used by the cmd-forwarder-vpp. If you have any suggestion how to try to reproduce the issue, or what direction could be followed please don't hesitate to add it here, thanks in advance
Hi, @elajkat! Unfortunately we didn't have enough time to reproduce the problem on VPP. You can try to reproduce it using NSM and check VPP which runs inside a forwader. Here are the steps:
After repeated scaling up/down I can observe several tap
interfaces with state:DOWN
in VPP inside the forwarder. The forwarder issues a request for deletion through govpp
, but interfaces are still there with state:DOWN
Hi, with the suggested scale up - scale down process I got some hanging interfaces, but those are in up state, i.e.:
=== pod/forwarder-vpp-vpf55 ===
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
host-eth0 1 up 1500/1500/1500/1500 rx packets 671059
rx bytes 335612875
tx packets 14067
tx bytes 1526406
drops 657122
ip4 342905
ip6 79
local0 0 down 0/0/0/0
tap25 42 up 1446/1446/1446/1446 rx packets 12
rx bytes 936
drops 12
ip6 12
vxlan_tunnel14 41 up 1446/1446/1446/1446 rx packets 55
rx bytes 3362
tx packets 50
tx bytes 4860
drops 5
Next week I check this in detail with VPP API trace and forwarder logs etc. and try reproduce this without forwarder with pure VPP and API calls.
You may find the vpp api trace helpful for capturing the sequence of API calls that leads to the issue for handoff to VPP folks for reproduction.
I also played around with a slightly modified reproduction. I was able to reliably reproduce the leftover tun/vxlan_tunnel interfaces. At this time I only focused on identifying the suggested vpp bug. However I found no indication in vpp's api trace, that we have a vpp bug here. Here's what I did and saw.
The environment:
The reproduction loop:
In the above repro steps it slightly bothered me (due to my weak kubernetes-fu) that we do not wait for the completion of a scale operation (other than sleeping for a hardcoded period) before we start a new scale operation. Therefore I replaced the sleeps with kubectl wait commands (and a final sleep just to make sure) as you can see below:
kubectl -n ns-kernel2ethernet2kernel scale deployment nse-kernel --replicas=5
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for=condition=Available=True deployment.apps/nse-kernel
kubectl -n ns-kernel2ethernet2kernel scale deployment alpine --replicas=10
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for=condition=Available=True deployment.apps/alpine
kubectl -n ns-kernel2ethernet2kernel scale deployment alpine --replicas=0
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for delete pod --selector=app=alpine
kubectl -n ns-kernel2ethernet2kernel scale deployment nse-kernel --replicas=0
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for delete pod --selector=app=nse-kernel
sleep 5
(note: Please correct me if you think these kubectl wait commands to be incorrect.)
With this method I could reliably produce leftover interfaces. Usually already in the 1st or 2nd iteration of the reproduction loop we have some. Always the same number of taps as vxlan_tunnels. Most of the time with the same suffix (like tap3 and vxlan_tunnel3), but sometimes with different suffixes (like tap0 and vxlan_tunnel2). These interfaces are always up.
Out of 20 runs leaving interfaces around, I have seen 20 in which the vpp api trace contains no indication that anybody ever tried to delete the leftover interfaces:
For tap interfaces I could grep for the tap_delete message with a concrete sw_if_index, e.g.:
$ kubectl exec forwarder-vpp-r2mz8 -n nsm-system -- vppctl api trace dump | egrep --no-group-separator -A1 tap_delete | egrep -w 'sw_if_index: 2'
For vxlan_tunnel interfaces unfortunately the vpp api trace contains only identifiers that are harder to pair with sw_if_index, therefore I only counted the number of vxlan_tunnel creates and deletes - looking for having fewer deletes than creates:
$ kubectl exec forwarder-vpp-2jfxd -n nsm-system -- vppctl api trace dump | agrep -d ^vl_api vxlan_add_del_tunnel | egrep is_add | sort -r | uniq -c
The difference between creates and deletes were the number of vxlan_tunnel interfaces left around in all 20 test runs.
In summary I could not confirm the presence of a vpp bug - or at least my reproduction method does not reproduce it.
By the way the fact that the leftover interfaces are always present in tap+vxlan_tunnel pairs also makes me think its cause is more likely to be in nsm than vpp.
If anybody is interested I'm happy to share the scipts I used. Let me know if you have other ideas to refine the reproduction steps.
Let me know what you think!
Thanks @rubasov. I actually used a similar setup like @rubasov and I am able to run the same with some well placed kubectl waits and an extra sleep 10 after scale up (alpine 40, nse 5) and after scale down (alpine 0, nse 0) and I have no leftover interfaces.
Hello @rubasov, thank you for testing it.
We're planning to continue working on this issue soon. As I recall, the main issue is revealed only to leaked vpp infaces with state up. At this moment, it could be helpful if you shared the scripts, and then we could verify steps to reproduce.
UPD: meant state: DOWN
Hi @denis-tingaikin,
I made some modifications and now I believe I have a better reproduction - hopefully reproducing every bug symptom Nikita described. I'm attaching the scripts I used.
My earlier approach (as in my previous comment) always stopped after the first leftover interface was detected (no matter if it was up or down). However these interfaces were always up and after a few minutes they got deleted, therefore it's not likely they could contribute to a long term memory leak. The current approach does not stop at the first leftover interface, but continues scaling the nsc deployment up and down. Usually after many more iterations (up to a 100, sometimes 150) this approach produces leftover interfaces both up and down (many up, a few down), out of which a few remains indefinitely (the down interfaces always remain).
With this complex reproduction I'm far from having a proper understanding of what's happening, however the following factors seem to be relevant:
Regarding the scripts. Please read them before running.
First run:
./setup-1.sh ; ./setup-2.sh ; ./scale.sh 3
setup-1.sh creates a kind cluster. I used it without having other kind clusters around. scale.sh's parameter is the iteration count. When no parameter is passed it continues until there's a leftover interface in down. setup-2.sh depends on https://github.com/rubasov/deployments-k8s/commit/54f2d2de5fbacefac2980cff8e5f3f4a7c27468b
Later runs:
./teardown-2.sh ; ./setup-2.sh ; ./scale.sh | tee scale.sh.log
Hi @rubasov, @elajkat. I collected the traces for the issue with state:DOWN
interfaces. Also added a small description: traces.zip. It took five up/down scalings to get two tap
interfaces with state: DOWN
.
Here are the scripts I use: iface-leaks.zip. Just run ./full-setup.sh
and then ./scale.sh
Hi @denis-tingaikin, @NikitaSkrynnik,
While @ljkiraly is out of office we're trying to track the status of this interface leak. We have seen https://github.com/networkservicemesh/sdk/pull/1650 merged. The pull request description says it should fix all interface leaks. However Nikita's comment in https://github.com/networkservicemesh/sdk/issues/1649 reads to me as if the merged patch was only part of a planned fix. We also built a new forwarder with the sdk patch merged and tested it with the reproduction method we discussed earlier. Unfortunately we still see leftover interfaces.
Can you please help us understand what exactly the merged patch fixes? Did we maybe miss another patch we should have used for a full fix?
Hi, @rubasov. We found out that we need two more patches to fix all interface leaks:
@rubasov Many thanks for testing it. Currently, we also looking for other solutions, it would be perfect if you could add these two patches https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/1120#issuecomment-2285725537 (note: requires SDK with commit 301631365421a9d916a5ab7224fa9fbd847320b2) and test it.
Hi @NikitaSkrynnik, @denis-tingaikin,
I did some more testing and wanted to get back to you with the results:
I started with sdk commit 78bedfb4 (which already includes commit 30163136 "Add a timeout for Closes in begin.Server (#1650)"). Then cherry-picked all commits of https://github.com/networkservicemesh/sdk/pull/1656 and https://github.com/networkservicemesh/sdk/pull/1655 on top.
I had to resolve one merge conflict between https://github.com/networkservicemesh/sdk/pull/1656/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R45 and https://github.com/networkservicemesh/sdk/pull/1655/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R46.
Then also noticed that the s/closeTimeout/contextTimeout/ rename had to be applied here too: https://github.com/networkservicemesh/sdk/pull/1655/files#diff-d03d72993c11636268d2d3264a65f9888845d7e1f052b40a1aa1c2c503b10814R91
Using this sdk I built a new forwarder and ran the usual reproduction steps. Unfortunately I still saw leftover interfaces as before.
However seeing these conflicts between PR 1656 and 1655 I'm wondering if I'm really testing with the same code as you are. It feels like I'm still missing something.
Let me know what you think.
@rubasov Could you try this forwarder image: nikitaxored/cmd-forwarder-vpp:no-leaks
? It contains all the patches. Could you also send script you use to reproduce the problem?
Hi @NikitaSkrynnik,
Thanks for the image! I started the repro script (which is still basically the same as I posted earlier) using your forwarder image. It seems to be working. Earlier, without the fixes, I believe the maximum iteration count I ever saw needed for an interface left in down was between 160-170. With the fixes included my repro is at iteration 670. And still no interface left in down. So I believe this at least significantly reduced the frequency or hopefully fully fixed it.
I'm interested in how this image was built differently from my last attempt. If you have the time, please share!
Thanks, Bence
@rubasov, we've merged all fixes into sdk main two days ago, you can check them there. I just used them to build the forwarder image.
Hi @NikitaSkrynnik,
I wanted to get back after we have done some longer test runs taking the main branch after all your fixes merged. We still occasionally see some leftover interfaces. But it usually takes 500-1000 nsc scale-up-down iterations (with the usual repro scripts) until the error happens once. I believe this is 1-1.5 magnitudes better than without the fixes. At this moment we hope this is good enough even for long-lived deployments. Of course if we learn anything new, we'll get back to you.
Thank you for your help! Bence
Statistics from rc2.
Interfaces:
2024-09-05 14:13:03 - Node: control-plane-control-seliics06748-n1 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:04 - Node: control-plane-control-seliics06749-n2 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:05 - Node: control-plane-control-seliics06750-n3 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:07 - Node: pool1-ceph-seliics06744-n7 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:08 - Node: pool1-ceph-seliics06745-n4 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:09 - Node: pool1-ceph-seliics06746-n5 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:10 - Node: pool1-ceph-seliics06747-n6 | number of ^tap interface: 3 | number of ^vxlan_tunnel interface: 0 |
We plan to revert the begin chain element https://github.com/networkservicemesh/sdk/issues/1668 when Netlink is released: https://github.com/vishvananda/netlink/issues/1019
forwarder-vpp_goroutineprofiles_20240527074108.tar.gz forwarder-vpp_memprofiles_20240527074153.tar.gz
TODO