Interface is not restored after restarting the forwarder

glazychev-art commented 10 months ago

Steps to reproduce

Create kind cluster with 2 nodes
Deploy spire
Deploy basic NSM setup
Deploy 2 NSEs on the same node: with IPv4 and IPv6 CIDRs
Deploy 16 NSC with disabled liveliness checker. Each one connects to both NSEs
Delete the nsmgr located on the NSEs node. After each restart check connectivity.
Catch a case when the ping does not work.

Issue: https://github.com/networkservicemesh/sdk-vpp/issues/802 Issue: https://github.com/networkservicemesh/sdk/issues/1586

glazychev-art commented 10 months ago

This is probably related: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/664

glazychev-art commented 9 months ago

@szvincze Could you please provide additional logs once this problem is reproduced for you? We found a couple of interesting behaviors, but I don't see it in your logs. So it will be cool to have more information. Thank you

szvincze commented 9 months ago

Hi @glazychev-art, Let me check when we can schedule these tests again. I will come back with the logs as soon as I get them.

glazychev-art commented 9 months ago

@szvincze Thanks! And if possible, please change the logging level from INFO to TRACE for all NSM components

szvincze commented 9 months ago

Hi @glazychev-art,

In this case the issue was reproduced after process restarts in multiple pods. Note that there were massive amount of robustness tests before it occurred. So, below I provide the important timestamps. Forwarder process in pod forwarder-vpp-ffjzv and spire-agent process in pod spire-agent-sb8cp were killed around 2024-02-20T06:58:22.385Z.

The affected pods became ready at 2024-02-20T06:58:29.160Z but the traffic only recovered at 2024-02-20T07:01:18.759Z. The problematic connection was between nse-ipv6-6f976dd8df-688hn and nsc-c58b69c55-sfvc2 [100:100::7]:5003.

I have attached the logs and this time the logging level was set to TRACE.

glazychev-art commented 9 months ago

Thanks @szvincze, As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

szvincze commented 9 months ago

As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

Hi @glazychev-art, The intention was to test it on the latest RC. So, let me double-check what happened.

szvincze commented 9 months ago

Here I attach the logs for the case I mentioned.

This time a NSMgr (nsmgr-vlnnw) pod was deleted at 2024-02-21T22:37:23, then almost all connections were broken immediately except two traffic instances. When the new pod (nsmgr-w2l6d) came up the connections restored quite soon. At the start of the next traffic iteration one traffic instance was not able to connect at all during the monitored period which was longer than 10 minutes. The affected pods were nse-ipv6-7c8cd797b5-p98x5 and nsc-6d5476bfbf-ss2rv.

glazychev-art commented 8 months ago

Problem area

We found that there is a problem that is most likely related to VPP tap interfaces.

To prove it:

Reproduce the problem
Go to the forwarder-vpp that serves the endpoint involved in the problem (nse-ipv6 for example).

Please run and share the output:

# vppctl show int

# vppctl show hardware-interfaces

# vppctl show errors
...
49367508            virtio-input                    buffer alloc error           error
...

# vppctl show buffers
Pool Name            Index NUMA  Size  Data Size  Total  Avail  Cached   Used  
default-numa-0         0     0   2304     2048    16808    0       0     16808

# vppctl show acl-plugin acl

If you don't see any problems here, try getting similar information from the forwarder that serves the problematic NSC

denis-tingaikin commented 8 months ago

cc @VitalyGushin

networkservicemesh / cmd-forwarder-vpp

Interface is not restored after restarting the forwarder #1047

Steps to reproduce

Problem area