networkservicemesh / govpp

Apache License 2.0
0 stars 5 forks source link

Update vpp version #9

Closed glazychev-art closed 1 year ago

glazychev-art commented 1 year ago

https://github.com/networkservicemesh/govpp/blob/main/Dockerfile#L1

TODO

  1. Update vpp to the latest main version. Because it contains commits that were not merged into the latest release (af_xdp for example)
  2. The work with memif interfaces (abstract sockets) has been changed. We need to figure out how to configure it now.
  3. It makes sense to update git.fd.io/govpp.git to the latest release (https://github.com/FDio/govpp). This can help with unstable calico-vpp tests. (Calico-vpp updates govpp regularly)
  4. Pass cmd-forwarder-vpp docker-tests
  5. Pass integration-tests
szvincze commented 1 year ago

We have used the _artgl/cmd-forwarder-vpp:vppc3f505fe7b7f image in our tests. The AF_PACKET related issue described in VPP-2081 occurred again but this time right after deployment. Forwarder restart has not improved the situation but made it worse since more forwarders started suffering from the same problem.

bellycat77 commented 1 year ago

We have used the _artgl/cmd-forwarder-vpp:vppc3f505fe7b7f image in our tests. The AF_PACKET related issue described in VPP-2081 occurred again but this time right after deployment. Forwarder restart has not improved the situation but made it worse since more forwarders started suffering from the same problem.

Do you have SELinux enabled? Could you try to disable it and test if the issue appears again?

denis-tingaikin commented 1 year ago

@szvincze I think we might need to check other versions of forwarder-vpp. It could help to detect problems.

v1.9.0
v1.8.0
v1.7.1
v1.6.2

If you will have a chance check the problem with these versions ☝️☝️☝️

glazychev-art commented 1 year ago

@szvincze @ljkiraly I have a few questions:

  1. I'm not sure that SLES is fully supported in VPP - https://s3-docs.fd.io/vpp/23.10/aboutvpp/supported.html Is it possible to check the problem on openSUSE?

  2. I have rebuilt the forwarder-vpp with af-packet v3 - artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_af_packet_v3 Please take a look

  3. Can you please tell if the problems are observed on the NSC interfaces? How many clients (approximately) are required?

UPDATE:

  1. Can the issue be reproduced on other OSes (e.g. Ubuntu?)

  2. Could the problem be related to the data size? Maybe there is a problem with MTU...

glazychev-art commented 1 year ago

Additional questions:

  1. Could you please share your configurations? How do you run multiple forwarders? Are they on the same node?

  2. Could you share your environment? Is SLES a node or container operating system?

robaganrab commented 1 year ago

Hello, SELinux is disabled on the impacted cluster. Various versions cannot be easily tested do to limitation on hardware and manpower (packaging a given version is not necessary trivial and the testing cycle is roughly one build per day, sometimes two).

@glazychev-art

  1. Yes, the issue was there with OpenSUSE as well.
  2. We are working on testing this.
  3. I think we have observed it on both the nse and nsc sides. Up to 10 clients.
  4. There was reproduction with Ubuntu base a month or so ago.
  5. We need to check the MTU settings. I let you know what we find.
  6. VPP is deployed as a DaemonSet. One pod on each eligible k8s nodes. Then on top of NSM/VPP we use Meridio that utilizes the extra networking and creates a complete NSC-NSE cross connect to handle higher level load balancing (eg. sticky session like setup).
  7. The issue manifests on a "CNIS" cluster. This is a bare metal k8s setup using Ericsson's CCD. SLES is used as operating system on bare metal compute nodes and those are joined into the k8s cluster. Then by default NSM/VPP/Spire components are packaged with SLES being the base Docker image.

Best Regards, Gábor Barna

robaganrab commented 1 year ago

Hello, The issue was reproduced on a relatively new kernel: 5.14.21-150400.24.63-default. Search for 5.14.21-150400.24.63 on the page. Best Regards, Gábor Barna

glazychev-art commented 1 year ago

Thanks @robaganrab !

If you have a chance, could you also check this forwarder image: artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3

Also, when you see the problem, please attach the command output from vpp: show hardware-interfaces

denis-tingaikin commented 1 year ago

Hello, @robaganrab , @szvincze

I'd like to suggest three ways for diagnostic the problem.

  1. test artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3
  2. I think we could also do some diagnostics (before/after reproducing)

2.1. vppctl show int 2.2. vppctl show hardware 2.3. vppctl trace add af-packet-input 1000 (do it once) 2.4. vppctl show trace (after reproducing)

  1. And also as @edwarnicke pointed we could try to disable tap and try to use veth interfaces (just cut this and build forwarder https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/client.go#L36)
robaganrab commented 1 year ago

Hello, We managed to get everything built and set up for test run with artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3. The test will include the vppctl commands and hopefully will provide useful output. Best Regards, Gábor Barna

ljkiraly commented 1 year ago

Hello @denis-tingaikin and @glazychev-art, Test results with artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3 is quite good, the AF_PACKET issue wasn't seen during the test run. What is the way forward? Which govpp is the base of this image? Or the sha c3f505fe7b7f is a vpp commit ID? We would like to build a SUSE based image from this version of vpp/govpp. Regards, Laszlo

denis-tingaikin commented 1 year ago

Hello @ljkiraly

It's a nice new!

We're used this patch https://github.com/networkservicemesh/govpp/pull/11

Its based on this commit from vpp https://gerrit.fd.io/r/gitweb?p=vpp.git;a=commit;h=c3f505fe7b7fbecb35494863a6c9de3cad6e6d2d

szvincze commented 1 year ago

@denis-tingaikin, @glazychev-art: There were more attempts with the image that Artem provided and the issue did not come at all. So, it seems to be fine.

However as @ljkiraly mentioned above there was a test with a newly built image based on SUSE and the issue happened immediately. We will give it another try with an openSUSE based image as well to see if SUSE is the culprit or openSUSE is also suffering from the same thing. We will keep you informed.

glazychev-art commented 1 year ago

@szvincze Got it, thanks!

Did you use this PR to build your image - https://github.com/networkservicemesh/govpp/pull/11?

Now there are two main suspects:

  1. Calico-vpp patches
  2. You are using tap interfaces, not _afpacket.

Therefore, it would be very cool to see vppctl show hardware-interfaces output from your forwarder (on failed tests)

szvincze commented 1 year ago

@glazychev-art: Yes, we used that PR for building the image. For the next trial I requested the output you mentioned.

glazychev-art commented 1 year ago

The fact is that for artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3 image, I locally removed calico-patches from this PR https://github.com/networkservicemesh/govpp/pull/11, and also removed the tap support (as Ed said and Denis described here)

ljkiraly commented 1 year ago

Hi @glazychev-art, By calico patches you mean: patch/0004-capo-Calico-Policies-plugin.patch? Can you elaborate how was the tap support removed? This would help us to build an openSUSE/SUSE based image similar to yours. BR/Laszlo

glazychev-art commented 1 year ago

@ljkiraly I mean this patches: https://github.com/networkservicemesh/govpp/blob/5482a9ac8fba3ab4cfc26b2142c67b8da0d671b8/patch/patch.sh#L15-L22 There are 1 cherry_pick and 5 files

By the way, I've prepared new PRs today (they contain the latest main vpp version) and removed the calico-patches in one of them: https://github.com/networkservicemesh/govpp/pull/12 https://github.com/networkservicemesh/govpp/pull/13 You can use it

To disable tap interfaces just delete these lines: https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/client.go#L35-L37 and https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/server.go#L35-L37

glazychev-art commented 1 year ago

I think we can close the issue, because we merged all PRs The discussion was moved to - https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/927