Dataplane part of `forwarder-vpp` leaks

denis-tingaikin commented 4 months ago

forwarder-vpp_goroutineprofiles_20240527074108.tar.gz forwarder-vpp_memprofiles_20240527074153.tar.gz

TODO

[ ] Find potential prpblem places
[ ] Create an issue for each problem
[ ] Estimate each problem
[ ] Provide a list of issues that can be delivered till 7 Jun
[ ] Provide fixes from previous point

denis-tingaikin commented 4 months ago

logs from fwd.tar.gz

Provided by @szvincze

NikitaSkrynnik commented 4 months ago

[x] Add CloseOnError chain elements to forwarder and nsmgr - 3h
[x] Test forwarder with the new elements - 3h
[x] Change timeout in forwarder - 3h
[x] Test forwarder with the new timeout - 3h
[x] Check if all chain elements cleanup themselves on errors from next - 3h

NikitaSkrynnik commented 4 months ago

Current state

It looks like vpp reuses indices for interfaces. It somehow affects the deletion of tap interfaces on Closes. Therefore, NSM shows no created tap interfaces but vpp still have them

Next steps

[x] Make manual management of interface indices - 3h
[x] Test manual management of interface indices - 3h
[x] Test without deleting files - 7h
[x] Investigate what causes the problem: govpp or vpp - 12h
- [x] Find a place where to add logs - 2h
- [x] Add logs to vpp - 3h
- [x] Try to reproduce the problem - 7h
[x] Find out if govpp doesn't delete some intefaces - 3h
[x] Test with Closes - 7h

Total: 35h The estimation may be increased

denis-tingaikin commented 4 months ago

/cc @szvincze

NikitaSkrynnik commented 3 months ago

Current status

forwarder-vpp has two controlplane problems and one dataplane problem.

Controlplane problems:

timeout chain element doesn't call Close and doesn't delete vpp intefaces
Even when forwarder-vpp calls Close some vxlan intefaces are not deleted anyway Issue: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/1129

Dataplane problems:

When forwarder-vpp calls Close some tap interfaces are not deleted, just switched to state: down. After investigation I found out that the problem is inside vpp

elajkat commented 2 months ago

Hi, @NikitaSkrynnik :Could you please help our investigation with the Dataplane Prblems you mentioned above and point to VPP? I loaded VPP (v23.10-rc0 local build) API with vxlan tunnel creates, tap creates, ACL xConnect create in batch for ~24 hours and was not able to reproduce such issue. I tried to use the same go API calls that are used by the cmd-forwarder-vpp. If you have any suggestion how to try to reproduce the issue, or what direction could be followed please don't hesitate to add it here, thanks in advance

NikitaSkrynnik commented 2 months ago

Hi, @elajkat! Unfortunately we didn't have enough time to reproduce the problem on VPP. You can try to reproduce it using NSM and check VPP which runs inside a forwader. Here are the steps:

Deploy a basic NSM setup
Start scaling clients and endpoints (should be repeated at least 10 times) 2.1. Scale the clients and endpoints up to 20 pods each 2.2. Wait for 40 seconds 2.3. Scale the clients and endpoints down to 0 2.4. Wait for 20 seconds 2.5. Scale to 20 again 2.6. ...

After repeated scaling up/down I can observe several tap interfaces with state:DOWN in VPP inside the forwarder. The forwarder issues a request for deletion through govpp, but interfaces are still there with state:DOWN

elajkat commented 2 months ago

Hi, with the suggested scale up - scale down process I got some hanging interfaces, but those are in up state, i.e.:

=== pod/forwarder-vpp-vpf55 ===
              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count     
host-eth0                         1      up     1500/1500/1500/1500 rx packets                671059
                                                                    rx bytes               335612875
                                                                    tx packets                 14067
                                                                    tx bytes                 1526406
                                                                    drops                     657122
                                                                    ip4                       342905
                                                                    ip6                           79
local0                            0     down          0/0/0/0       
tap25                             42     up     1446/1446/1446/1446 rx packets                    12
                                                                    rx bytes                     936
                                                                    drops                         12
                                                                    ip6                           12
vxlan_tunnel14                    41     up     1446/1446/1446/1446 rx packets                    55
                                                                    rx bytes                    3362
                                                                    tx packets                    50
                                                                    tx bytes                    4860
                                                                    drops                          5

Next week I check this in detail with VPP API trace and forwarder logs etc. and try reproduce this without forwarder with pure VPP and API calls.

edwarnicke commented 2 months ago

You may find the vpp api trace helpful for capturing the sequence of API calls that leads to the issue for handoff to VPP folks for reproduction.

rubasov commented 2 months ago

I also played around with a slightly modified reproduction. I was able to reliably reproduce the leftover tun/vxlan_tunnel interfaces. At this time I only focused on identifying the suggested vpp bug. However I found no indication in vpp's api trace, that we have a vpp bug here. Here's what I did and saw.

The environment:

kind with 4 workers (but the repro seems to work with 2 workers too)
spire single cluster
nsm basic
registry-k8s: 2 replicas
nsm kernel2ethernet2kernel, slightly modified: we have wrapped the alpine pod in a deployment, so we can scale it

The reproduction loop:

while not having a leftover tap and/or vxlan_tunnel interface
in each forwarder: start vpp api tracing with an empty buffer (so that at the end we shall see only the api trace of the last up-and-down scale)
scale up: nse-kernel, alpine
scale down (to 0 replicas): alpine, nse-kernel
after all scale operations completed:
in each forwarder: check for leftover tap and/or vxlan_tunnel interfaces
if we have some:
print affected pods and interfaces
grep vpp api trace dumps for relevant tap_delete messages
grep and count vpp api trace dumps for vxlan_tunnel create/delete messages and see if we have fewer deletes than creates

In the above repro steps it slightly bothered me (due to my weak kubernetes-fu) that we do not wait for the completion of a scale operation (other than sleeping for a hardcoded period) before we start a new scale operation. Therefore I replaced the sleeps with kubectl wait commands (and a final sleep just to make sure) as you can see below:

kubectl -n ns-kernel2ethernet2kernel scale deployment nse-kernel --replicas=5
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for=condition=Available=True deployment.apps/nse-kernel

kubectl -n ns-kernel2ethernet2kernel scale deployment alpine --replicas=10
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for=condition=Available=True deployment.apps/alpine

kubectl -n ns-kernel2ethernet2kernel scale deployment alpine --replicas=0
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for delete pod --selector=app=alpine

kubectl -n ns-kernel2ethernet2kernel scale deployment nse-kernel --replicas=0
kubectl -n ns-kernel2ethernet2kernel wait --timeout=1m --for delete pod --selector=app=nse-kernel

sleep 5

(note: Please correct me if you think these kubectl wait commands to be incorrect.)

With this method I could reliably produce leftover interfaces. Usually already in the 1st or 2nd iteration of the reproduction loop we have some. Always the same number of taps as vxlan_tunnels. Most of the time with the same suffix (like tap3 and vxlan_tunnel3), but sometimes with different suffixes (like tap0 and vxlan_tunnel2). These interfaces are always up.

Out of 20 runs leaving interfaces around, I have seen 20 in which the vpp api trace contains no indication that anybody ever tried to delete the leftover interfaces:

For tap interfaces I could grep for the tap_delete message with a concrete sw_if_index, e.g.:

$ kubectl exec forwarder-vpp-r2mz8 -n nsm-system -- vppctl api trace dump | egrep --no-group-separator -A1 tap_delete | egrep -w 'sw_if_index: 2'

For vxlan_tunnel interfaces unfortunately the vpp api trace contains only identifiers that are harder to pair with sw_if_index, therefore I only counted the number of vxlan_tunnel creates and deletes - looking for having fewer deletes than creates:

$ kubectl exec forwarder-vpp-2jfxd -n nsm-system -- vppctl api trace dump | agrep -d ^vl_api vxlan_add_del_tunnel | egrep is_add | sort -r | uniq -c

The difference between creates and deletes were the number of vxlan_tunnel interfaces left around in all 20 test runs.

In summary I could not confirm the presence of a vpp bug - or at least my reproduction method does not reproduce it.

By the way the fact that the leftover interfaces are always present in tap+vxlan_tunnel pairs also makes me think its cause is more likely to be in nsm than vpp.

If anybody is interested I'm happy to share the scipts I used. Let me know if you have other ideas to refine the reproduction steps.

Let me know what you think!

elajkat commented 2 months ago

Thanks @rubasov. I actually used a similar setup like @rubasov and I am able to run the same with some well placed kubectl waits and an extra sleep 10 after scale up (alpine 40, nse 5) and after scale down (alpine 0, nse 0) and I have no leftover interfaces.

denis-tingaikin commented 2 months ago

Hello @rubasov, thank you for testing it.

We're planning to continue working on this issue soon. As I recall, the main issue is revealed only to leaked vpp infaces with state up. At this moment, it could be helpful if you shared the scripts, and then we could verify steps to reproduce.

UPD: meant state: DOWN

rubasov commented 2 months ago

Hi @denis-tingaikin,

I made some modifications and now I believe I have a better reproduction - hopefully reproducing every bug symptom Nikita described. I'm attaching the scripts I used.

My earlier approach (as in my previous comment) always stopped after the first leftover interface was detected (no matter if it was up or down). However these interfaces were always up and after a few minutes they got deleted, therefore it's not likely they could contribute to a long term memory leak. The current approach does not stop at the first leftover interface, but continues scaling the nsc deployment up and down. Usually after many more iterations (up to a 100, sometimes 150) this approach produces leftover interfaces both up and down (many up, a few down), out of which a few remains indefinitely (the down interfaces always remain).

With this complex reproduction I'm far from having a proper understanding of what's happening, however the following factors seem to be relevant:

the first bug of leaving many up interfaces around
the later cleanup process of these interfaces overlapping with the load caused by the still continuously ongoing scaling
errors in the forwarder logs like (their frequency: in a handful of iterations out of 100): vppapi AddNodeNext returned error: context canceled vppapi TapCreateV3 returned error: context canceled vppapi SwInterfaceTagAddDel returned error: context canceled When these are present the corresponding vpp api trace does not show these calls and there's nothing relevant in vppctl show logging.

Regarding the scripts. Please read them before running.

First run:

./setup-1.sh ; ./setup-2.sh ; ./scale.sh 3

setup-1.sh creates a kind cluster. I used it without having other kind clusters around. scale.sh's parameter is the iteration count. When no parameter is passed it continues until there's a leftover interface in down. setup-2.sh depends on https://github.com/rubasov/deployments-k8s/commit/54f2d2de5fbacefac2980cff8e5f3f4a7c27468b

Later runs:

./teardown-2.sh ; ./setup-2.sh ; ./scale.sh | tee scale.sh.log

nsm-scale-iface-leftover.tar.gz

NikitaSkrynnik commented 2 months ago

Hi @rubasov, @elajkat. I collected the traces for the issue with state:DOWN interfaces. Also added a small description: traces.zip. It took five up/down scalings to get two tap interfaces with state: DOWN.

Here are the scripts I use: iface-leaks.zip. Just run ./full-setup.sh and then ./scale.sh

rubasov commented 1 month ago

Hi @denis-tingaikin, @NikitaSkrynnik,

While @ljkiraly is out of office we're trying to track the status of this interface leak. We have seen https://github.com/networkservicemesh/sdk/pull/1650 merged. The pull request description says it should fix all interface leaks. However Nikita's comment in https://github.com/networkservicemesh/sdk/issues/1649 reads to me as if the merged patch was only part of a planned fix. We also built a new forwarder with the sdk patch merged and tested it with the reproduction method we discussed earlier. Unfortunately we still see leftover interfaces.

Can you please help us understand what exactly the merged patch fixes? Did we maybe miss another patch we should have used for a full fix?

NikitaSkrynnik commented 1 month ago

Hi, @rubasov. We found out that we need two more patches to fix all interface leaks:

denis-tingaikin commented 1 month ago

@rubasov Many thanks for testing it. Currently, we also looking for other solutions, it would be perfect if you could add these two patches https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/1120#issuecomment-2285725537 (note: requires SDK with commit 301631365421a9d916a5ab7224fa9fbd847320b2) and test it.

rubasov commented 1 month ago

Hi @NikitaSkrynnik, @denis-tingaikin,

I did some more testing and wanted to get back to you with the results:

I started with sdk commit 78bedfb4 (which already includes commit 30163136 "Add a timeout for Closes in begin.Server (#1650)"). Then cherry-picked all commits of https://github.com/networkservicemesh/sdk/pull/1656 and https://github.com/networkservicemesh/sdk/pull/1655 on top.

I had to resolve one merge conflict between https://github.com/networkservicemesh/sdk/pull/1656/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R45 and https://github.com/networkservicemesh/sdk/pull/1655/files#diff-963d51540e41ec0a05680ad81a593e4195fb8092eceb003e3f53778ec502fb74R46.

Then also noticed that the s/closeTimeout/contextTimeout/ rename had to be applied here too: https://github.com/networkservicemesh/sdk/pull/1655/files#diff-d03d72993c11636268d2d3264a65f9888845d7e1f052b40a1aa1c2c503b10814R91

Using this sdk I built a new forwarder and ran the usual reproduction steps. Unfortunately I still saw leftover interfaces as before.

However seeing these conflicts between PR 1656 and 1655 I'm wondering if I'm really testing with the same code as you are. It feels like I'm still missing something.

Let me know what you think.

NikitaSkrynnik commented 1 month ago

@rubasov Could you try this forwarder image: nikitaxored/cmd-forwarder-vpp:no-leaks? It contains all the patches. Could you also send script you use to reproduce the problem?

rubasov commented 1 month ago

Hi @NikitaSkrynnik,

Thanks for the image! I started the repro script (which is still basically the same as I posted earlier) using your forwarder image. It seems to be working. Earlier, without the fixes, I believe the maximum iteration count I ever saw needed for an interface left in down was between 160-170. With the fixes included my repro is at iteration 670. And still no interface left in down. So I believe this at least significantly reduced the frequency or hopefully fully fixed it.

I'm interested in how this image was built differently from my last attempt. If you have the time, please share!

Thanks, Bence

NikitaSkrynnik commented 1 month ago

@rubasov, we've merged all fixes into sdk main two days ago, you can check them there. I just used them to build the forwarder image.

rubasov commented 1 month ago

Hi @NikitaSkrynnik,

I wanted to get back after we have done some longer test runs taking the main branch after all your fixes merged. We still occasionally see some leftover interfaces. But it usually takes 500-1000 nsc scale-up-down iterations (with the usual repro scripts) until the error happens once. I believe this is 1-1.5 magnitudes better than without the fixes. At this moment we hope this is good enough even for long-lived deployments. Of course if we learn anything new, we'll get back to you.

Thank you for your help! Bence

denis-tingaikin commented 3 weeks ago

Statistics from rc2.

Interfaces:
2024-09-05 14:13:03 - Node: control-plane-control-seliics06748-n1 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:04 - Node: control-plane-control-seliics06749-n2 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:05 - Node: control-plane-control-seliics06750-n3 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:07 - Node: pool1-ceph-seliics06744-n7 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:08 - Node: pool1-ceph-seliics06745-n4 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:09 - Node: pool1-ceph-seliics06746-n5 | number of ^tap interface: 0 | number of ^vxlan_tunnel interface: 0 |
2024-09-05 14:13:10 - Node: pool1-ceph-seliics06747-n6 | number of ^tap interface: 3 | number of ^vxlan_tunnel interface: 0 |

NikitaSkrynnik commented 2 weeks ago

We plan to revert the begin chain element https://github.com/networkservicemesh/sdk/issues/1668 when Netlink is released: https://github.com/vishvananda/netlink/issues/1019

networkservicemesh / cmd-forwarder-vpp