networkservicemesh / cmd-forwarder-vpp

Apache License 2.0
2 stars 22 forks source link

Interface is not restored after restarting the forwarder #1047

Open glazychev-art opened 10 months ago

glazychev-art commented 10 months ago

Steps to reproduce

  1. Create kind cluster with 2 nodes
  2. Deploy spire
  3. Deploy basic NSM setup
  4. Deploy 2 NSEs on the same node: with IPv4 and IPv6 CIDRs
  5. Deploy 16 NSC with disabled liveliness checker. Each one connects to both NSEs
  6. Delete the nsmgr located on the NSEs node. After each restart check connectivity.
  7. Catch a case when the ping does not work.

Issue: https://github.com/networkservicemesh/sdk-vpp/issues/802 Issue: https://github.com/networkservicemesh/sdk/issues/1586

glazychev-art commented 10 months ago

This is probably related: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/664

glazychev-art commented 9 months ago

@szvincze Could you please provide additional logs once this problem is reproduced for you? We found a couple of interesting behaviors, but I don't see it in your logs. So it will be cool to have more information. Thank you

szvincze commented 9 months ago

Hi @glazychev-art, Let me check when we can schedule these tests again. I will come back with the logs as soon as I get them.

glazychev-art commented 9 months ago

@szvincze Thanks! And if possible, please change the logging level from INFO to TRACE for all NSM components

szvincze commented 9 months ago

Hi @glazychev-art,

In this case the issue was reproduced after process restarts in multiple pods. Note that there were massive amount of robustness tests before it occurred. So, below I provide the important timestamps. Forwarder process in pod forwarder-vpp-ffjzv and spire-agent process in pod spire-agent-sb8cp were killed around 2024-02-20T06:58:22.385Z.

The affected pods became ready at 2024-02-20T06:58:29.160Z but the traffic only recovered at 2024-02-20T07:01:18.759Z. The problematic connection was between nse-ipv6-6f976dd8df-688hn and nsc-c58b69c55-sfvc2 [100:100::7]:5003.

I have attached the logs and this time the logging level was set to TRACE.

glazychev-art commented 9 months ago

Thanks @szvincze, As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

szvincze commented 9 months ago

As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.

Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)

Hi @glazychev-art, The intention was to test it on the latest RC. So, let me double-check what happened.

szvincze commented 9 months ago

Here I attach the logs for the case I mentioned.

This time a NSMgr (nsmgr-vlnnw) pod was deleted at 2024-02-21T22:37:23, then almost all connections were broken immediately except two traffic instances. When the new pod (nsmgr-w2l6d) came up the connections restored quite soon. At the start of the next traffic iteration one traffic instance was not able to connect at all during the monitored period which was longer than 10 minutes. The affected pods were nse-ipv6-7c8cd797b5-p98x5 and nsc-6d5476bfbf-ss2rv.

glazychev-art commented 8 months ago

Problem area

We found that there is a problem that is most likely related to VPP tap interfaces.

To prove it:

  1. Reproduce the problem
  2. Go to the forwarder-vpp that serves the endpoint involved in the problem (nse-ipv6 for example).
  3. Please run and share the output:
    # vppctl show int
    # vppctl show hardware-interfaces
    # vppctl show errors
    ...
    49367508            virtio-input                    buffer alloc error           error
    ...
    # vppctl show buffers
    Pool Name            Index NUMA  Size  Data Size  Total  Avail  Cached   Used  
    default-numa-0         0     0   2304     2048    16808    0       0     16808 
    # vppctl show acl-plugin acl

If you don't see any problems here, try getting similar information from the forwarder that serves the problematic NSC

denis-tingaikin commented 8 months ago

cc @VitalyGushin