Closed glazychev-art closed 1 year ago
We have used the _artgl/cmd-forwarder-vpp:vppc3f505fe7b7f image in our tests. The AF_PACKET related issue described in VPP-2081 occurred again but this time right after deployment. Forwarder restart has not improved the situation but made it worse since more forwarders started suffering from the same problem.
We have used the _artgl/cmd-forwarder-vpp:vppc3f505fe7b7f image in our tests. The AF_PACKET related issue described in VPP-2081 occurred again but this time right after deployment. Forwarder restart has not improved the situation but made it worse since more forwarders started suffering from the same problem.
Do you have SELinux enabled? Could you try to disable it and test if the issue appears again?
@szvincze I think we might need to check other versions of forwarder-vpp. It could help to detect problems.
v1.9.0
v1.8.0
v1.7.1
v1.6.2
If you will have a chance check the problem with these versions ☝️☝️☝️
@szvincze @ljkiraly I have a few questions:
I'm not sure that SLES is fully supported in VPP - https://s3-docs.fd.io/vpp/23.10/aboutvpp/supported.html Is it possible to check the problem on openSUSE?
I have rebuilt the forwarder-vpp with af-packet v3 - artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_af_packet_v3
Please take a look
Can you please tell if the problems are observed on the NSC interfaces? How many clients (approximately) are required?
UPDATE:
Can the issue be reproduced on other OSes (e.g. Ubuntu?)
Could the problem be related to the data size? Maybe there is a problem with MTU...
Additional questions:
Could you please share your configurations? How do you run multiple forwarders? Are they on the same node?
Could you share your environment? Is SLES a node or container operating system?
Hello, SELinux is disabled on the impacted cluster. Various versions cannot be easily tested do to limitation on hardware and manpower (packaging a given version is not necessary trivial and the testing cycle is roughly one build per day, sometimes two).
@glazychev-art
Best Regards, Gábor Barna
Hello,
The issue was reproduced on a relatively new kernel: 5.14.21-150400.24.63-default. Search for 5.14.21-150400.24.63
on the page.
Best Regards,
Gábor Barna
Thanks @robaganrab !
If you have a chance, could you also check this forwarder image: artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3
Also, when you see the problem, please attach the command output from vpp: show hardware-interfaces
Hello, @robaganrab , @szvincze
I'd like to suggest three ways for diagnostic the problem.
2.1. vppctl show int
2.2. vppctl show hardware
2.3. vppctl trace add af-packet-input 1000
(do it once)
2.4. vppctl show trace
(after reproducing)
Hello,
We managed to get everything built and set up for test run with artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3
. The test will include the vppctl
commands and hopefully will provide useful output.
Best Regards,
Gábor Barna
Hello @denis-tingaikin and @glazychev-art, Test results with artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3 is quite good, the AF_PACKET issue wasn't seen during the test run. What is the way forward? Which govpp is the base of this image? Or the sha c3f505fe7b7f is a vpp commit ID? We would like to build a SUSE based image from this version of vpp/govpp. Regards, Laszlo
Hello @ljkiraly
It's a nice new!
We're used this patch https://github.com/networkservicemesh/govpp/pull/11
Its based on this commit from vpp https://gerrit.fd.io/r/gitweb?p=vpp.git;a=commit;h=c3f505fe7b7fbecb35494863a6c9de3cad6e6d2d
@denis-tingaikin, @glazychev-art: There were more attempts with the image that Artem provided and the issue did not come at all. So, it seems to be fine.
However as @ljkiraly mentioned above there was a test with a newly built image based on SUSE and the issue happened immediately. We will give it another try with an openSUSE based image as well to see if SUSE is the culprit or openSUSE is also suffering from the same thing. We will keep you informed.
@szvincze Got it, thanks!
Did you use this PR to build your image - https://github.com/networkservicemesh/govpp/pull/11?
Now there are two main suspects:
Therefore, it would be very cool to see vppctl show hardware-interfaces
output from your forwarder (on failed tests)
@glazychev-art: Yes, we used that PR for building the image. For the next trial I requested the output you mentioned.
The fact is that for artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3
image, I locally removed calico-patches from this PR https://github.com/networkservicemesh/govpp/pull/11, and also removed the tap support (as Ed said and Denis described here)
Hi @glazychev-art,
By calico patches you mean: patch/0004-capo-Calico-Policies-plugin.patch
?
Can you elaborate how was the tap support removed? This would help us to build an openSUSE/SUSE based image similar to yours.
BR/Laszlo
@ljkiraly I mean this patches: https://github.com/networkservicemesh/govpp/blob/5482a9ac8fba3ab4cfc26b2142c67b8da0d671b8/patch/patch.sh#L15-L22 There are 1 cherry_pick and 5 files
By the way, I've prepared new PRs today (they contain the latest main vpp version) and removed the calico-patches in one of them: https://github.com/networkservicemesh/govpp/pull/12 https://github.com/networkservicemesh/govpp/pull/13 You can use it
To disable tap interfaces just delete these lines: https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/client.go#L35-L37 and https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/server.go#L35-L37
I think we can close the issue, because we merged all PRs The discussion was moved to - https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/927
https://github.com/networkservicemesh/govpp/blob/main/Dockerfile#L1
TODO
git.fd.io/govpp.git
to the latest release (https://github.com/FDio/govpp). This can help with unstable calico-vpp tests. (Calico-vpp updates govpp regularly)