Open glazychev-art opened 10 months ago
This is probably related: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/664
@szvincze Could you please provide additional logs once this problem is reproduced for you? We found a couple of interesting behaviors, but I don't see it in your logs. So it will be cool to have more information. Thank you
Hi @glazychev-art, Let me check when we can schedule these tests again. I will come back with the logs as soon as I get them.
@szvincze Thanks! And if possible, please change the logging level from INFO to TRACE for all NSM components
Hi @glazychev-art,
In this case the issue was reproduced after process restarts in multiple pods. Note that there were massive amount of robustness tests before it occurred. So, below I provide the important timestamps. Forwarder process in pod forwarder-vpp-ffjzv and spire-agent process in pod spire-agent-sb8cp were killed around 2024-02-20T06:58:22.385Z.
The affected pods became ready at 2024-02-20T06:58:29.160Z but the traffic only recovered at 2024-02-20T07:01:18.759Z. The problematic connection was between nse-ipv6-6f976dd8df-688hn and nsc-c58b69c55-sfvc2 [100:100::7]:5003.
I have attached the logs and this time the logging level was set to TRACE.
Thanks @szvincze, As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.
Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)
As I see from the logs, you are using the previous NSM v1.11.2. This problem is already fixed in v.1.12 I think.
Is it possible to get logs from v.1.12? (for example, from our latest v1.12.1-rc.1)
Hi @glazychev-art, The intention was to test it on the latest RC. So, let me double-check what happened.
Here I attach the logs for the case I mentioned.
This time a NSMgr (nsmgr-vlnnw) pod was deleted at 2024-02-21T22:37:23, then almost all connections were broken immediately except two traffic instances. When the new pod (nsmgr-w2l6d) came up the connections restored quite soon. At the start of the next traffic iteration one traffic instance was not able to connect at all during the monitored period which was longer than 10 minutes. The affected pods were nse-ipv6-7c8cd797b5-p98x5 and nsc-6d5476bfbf-ss2rv.
We found that there is a problem that is most likely related to VPP tap interfaces.
To prove it:
# vppctl show int
# vppctl show hardware-interfaces
# vppctl show errors
...
49367508 virtio-input buffer alloc error error
...
# vppctl show buffers
Pool Name Index NUMA Size Data Size Total Avail Cached Used
default-numa-0 0 0 2304 2048 16808 0 0 16808
# vppctl show acl-plugin acl
If you don't see any problems here, try getting similar information from the forwarder that serves the problematic NSC
cc @VitalyGushin
Steps to reproduce
Issue: https://github.com/networkservicemesh/sdk-vpp/issues/802 Issue: https://github.com/networkservicemesh/sdk/issues/1586