networkservicemesh / integration-k8s-kind

Apache License 2.0
4 stars 18 forks source link

Calico/VPP NSM integration #325

Open edwarnicke opened 3 years ago

edwarnicke commented 3 years ago

Calico allows for a choice of dataplanes. VPP is one of them.

Normally cmd-forwarder-vpp, normally cmd-forwarder-vpp starts its own instance of vpp in its own Pod.

In response to a request for integration between NSM and Calico/VPP, the process for integration was described.

This issue is about actually trying (and shaking the bugs out of) such integration.

This breaks down into a number of steps:

edwarnicke commented 3 years ago

This fix is needed for Calico/VPP to work in Kind: https://github.com/projectcalico/vpp-dataplane/pull/204

Even though that PR has not been merged, the docker images have been pushed and these steps should work: https://github.com/projectcalico/vpp-dataplane/pull/204/files#diff-9004d08acd588e7b7e93a8ff6fbe357d4eba3adc003d48ab4b7bed0186af1a11R1

AloysAugustin commented 3 years ago

Hi @edwarnicke , just an note that testing the integration in GKE will be complex at this stage, because we haven't found a way to override the default CNI in GKE, so Calico/VPP doesn't work there for now. Also, why are you making a difference in the last two steps between Calico/VPP owning or not the main interface? Calico/VPP has relatively strong assumptions that it owns the main interface (= the interface that has the k8s Node address). Giving Calico/VPP an other interface will likely result in a non-functional cluster. As a side note, we are starting to look at giving more than one interface to VPP in a Calico/VPP deployment, but that isn't supported yet.

edwarnicke commented 3 years ago

Also, why are you making a difference in the last two steps between Calico/VPP owning or not the main interface? Calico/VPP has relatively strong assumptions that it owns the main interface (= the interface that has the k8s Node address). Giving Calico/VPP an other interface will likely result in a non-functional cluster. As a side note, we are starting to look at giving more than one interface to VPP in a Calico/VPP deployment, but that isn't supported yet.

@AloysAugustin You are correct, I should have phrased the last one differently. I was thinking in terms of 'attaches to the interface with vfio' vs 'attaches to the interface with AF_XDP'... the idea being to attach with the highest performance option.

AloysAugustin commented 3 years ago

Ah, sounds good then :+1:

Bolodya1997 commented 3 years ago

Calico uses VPP v21.xx, so probably we need to first update used VPP version in Forwarder and after make a new try: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/284

Bolodya1997 commented 3 years ago

@edwarnicke We are still facing issues with with VPP/govpp versions used in Calico and used in VPP Forwarder. Currently to make it work I need to:

  1. Build special Calico version with all our VPP patches applied - https://github.com/Bolodya1997/vpp-dataplane/blob/nsm/vpplink/binapi/vpp_clone_current.sh
  2. Replace github.com/edwarnicke/govpp/binapi with github.com/projectcalico/vpp-dataplane/vpplink/binapi/vppapi in VPP Forwarder.

Probably [2] step is not actually needed and can be fixed with changing used govpp version in [1] - needs to be tested.

But it actually looks like if we want to support such integration, we need to provide Calico images, k8s configuration files. Is it OK?

Bolodya1997 commented 3 years ago

There are 2 issues needs to be fixed to make it work:

  1. Use Calico VPP in Client - https://github.com/networkservicemesh/cmd-nsc-vpp/pull/236.
  2. Make memif and memifproxy socket files shared with Calico VPP pod - https://github.com/networkservicemesh/sdk-vpp/issues/357.

Currently there is another issue - Calico and NSM uses different VPP versions with some different additional patches, so for testing I am currently using Calico VPP fork with added NSM VPP patches: https://github.com/Bolodya1997/vpp-dataplane/blob/nsm-new/vpplink/binapi/vpp_clone_current.sh. And govpp fork with added Calico VPP patches (added only to generated part, not to the generator): https://github.com/Bolodya1997/govpp/tree/calico.

Update: Memif2Memif test case is not currently working - https://github.com/networkservicemesh/sdk-vpp/issues/362.

Bolodya1997 commented 3 years ago

Failing to start k8s cluster with Calico on packet, so created an issue to the Calico team - https://github.com/projectcalico/vpp-dataplane/issues/217.

Update: succeeded to setup a cluster, currently working with tests.

Update: All basic scenarios except Memif2Memif currently work - networkservicemesh/sdk-vpp#362.

Bolodya1997 commented 3 years ago
Vladimir Popov Yesterday at 5:38 PM
---
Hi, I am trying to use vpp-calico with different cloud providers: [AKS, GKE, AWS].
On project wiki I have found page only for AWS integration. Does it mean that [AKS, GKE] currently can’t
be configured to use vpp-calico?

Aloys Augustin  2 hours ago
---
Hi Vladimir, at this point only EKS is officially supported. We're working on AKS support which may come
in the near future. GKE is less likely to be supported soon because GKE doesn't allow to swap the CNI,
however there is always the option to deploy a self-managed cluster on google cloud as well.

@edwarnicke Looks like it can be hardly possible to test NSM with Calico VPP on GKE, AKS.

Bolodya1997 commented 3 years ago

Used https://docs.projectcalico.org/reference/vpp/uplink-configuration Using DPDK -> With available hugepages. @edwarnicke is it exactly what you mean by binding interface with vfio?

All basic scenarios except Memif2Memif currently work. Tested additionally with Vfio2Noop scenario to make sure that there is no problem with VFIO - also works well.

Bolodya1997 commented 3 years ago

Tested with the abstract sockets solution. All basic scenarios except Memif2Memif currently work.

Bolodya1997 commented 3 years ago

@edwarnicke Do we want to have any CI running for this issue?

edwarnicke commented 3 years ago

Yes

Bolodya1997 commented 3 years ago

@edwarnicke Please, take a look at the following schemes and algorithms. Are all of them OK or do we need something to be implemented in some other way?

Node scheme

  1. VPP Forwarder uses node VPP instance aka Calico VPP.
  2. NSC, NSE uses their own VPP instances. image

memif to xxx

  1. NSC requests Forwarder for a memif connection.
    • netns file
  2. Forwarder requests NSE (probably remote over remote Forwarder) and creates a NSE-side connection.
  3. Forwarder requests VPP to create a memif server socket.
    • abstract socket path
    • netns file
  4. Forwarder creates xconnect with VPP.
  5. Forwarder responses back to NSC.
    • abstract socket path
  6. NSC requests VPP to create a memif client socket.
    • abstract socket path

xxx to memif

  1. NSC (probably remote over remote Forwarder) requests Forwarder for some connection.
  2. Forwarder requests NSE for a memif connection.
  3. NSE requests VPP to create a memif server socket.
    • abstract socket path
  4. NSE responses back to Forwarder.
    • abstract socket path
    • netns file
  5. Forwarder requests VPP to create a memif client socket.
    • abstract socket path
    • netns file
  6. Forwarder creates NSC-side connection.
  7. Forwarder creates xconnect with VPP.
  8. Forwarder responses back to NSC.

memif to memif

  1. NSC requests Forwarder for a memif connection.
    • NSC netns file
  2. Forwarder requests NSE for a memif connection.
  3. NSE requests VPP to create a memif server socket.
    • NSE abstract socket path
  4. NSE responses back to Forwarder.
    • NSE abstract socket path
    • NSE netns file
  5. Forwarder creates memif proxy socket on proxy abstract socket path in NSC netns and starts transferring all data between NSE abstract socket path in NSE netns.
  6. Forwarder responses back to NSC.
    • proxy abstract socket path
  7. NSC requests VPP to create a memif client socket.
    • proxy abstract socket path
edwarnicke commented 3 years ago

This looks about right yes :)

Bolodya1997 commented 3 years ago

@edwarnicke Calico has integrated all needed for NSM patches to their VPP, but we still have different VPP version, so cmd-forwarder-vpp cannot be directly used with Calico VPP. Should we create a new cmd-forwarder-vpp-calico with govpp generated for Calico VPP version? Or maybe we should use last release Calico VPP as a base for the NSM VPP applications and so just update govpp?

edwarnicke commented 3 years ago

@Bolodya1997 I'll spin a new image with their patches, test it, and we can look at upgrading.

Is everything else working well? Is it just a matter of updating our image and landing some PRs from you?

Bolodya1997 commented 3 years ago

Is everything else working well? Is it just a matter of updating our image and landing some PRs from you?

I am working on abstract sockets memif implementation, it will be clear after I will finish and test it. Currently it is still not clear whether there is or not an issue with LinkUP events, because it possibly can be caused by old solution.

Bolodya1997 commented 3 years ago

Is everything else working well? Is it just a matter of updating our image and landing some PRs from you?

@edwarnicke We have a problem with Calico VPP setup on packet - internet is not accessible from pods without hostNetwork: true. It actually looks like I am missing something in configuration, filed an issue for this in Calico repo - https://github.com/projectcalico/vpp-dataplane/issues/263.

This issue affects DNS test, but we are planning to rework it, because it would make more sense if test nslookup something like kubernetes.default instead of google.com and so we don't need internet access in such case.

All other basic/feature tests are working, currently I am working on CI.

edwarnicke commented 2 years ago

@Mixaster995 You will probably need this: https://github.com/networkservicemesh/cmd-forwarder-vpp/pull/421

glazychev-art commented 2 years ago

@edwarnicke

We have tested Calico Integration PR and we have the following suggestions:

1. Сluster on which we will do the integration

1.1 Packet

This PR does the integration Calico on Packet. Problems:

2. Forwarder configuration

Currently, we have 2 version of tests - usual and for Calico. We need to consider use only one version. We can try to use External VppAPISocket as default (https://github.com/networkservicemesh/cmd-forwarder-vpp/blob/main/internal/config/config.go#L49) and mount this socket from host to a specific folder on forwarder (vpp-ext for example). Forwarder will check for the default VPP API socket on startup. So, if we have one - use it (Calico case), if not - create a new vpp instance (current behavior).

3. Healing.

As I remember there are many chain elements that (explicitly or not) assume that forwarder death == vpp death. It's not right for the Calico case. We need to come up with a correct VPP cleaning when the forwarder is restarted:

Questions

  1. What do you think about Kind integration?
  2. Is it fine to use external VPP socket as default for Forwarder?
  3. Any thoughts about Healing solutions?
glazychev-art commented 2 years ago

Description

There is a problem with forwarder configuration. It is related to network namespaces - Calico-VPP doesn't have grpcfd. For example, when we connect to the Endpoint, forwarder receives network namespace fd using grpcfd. But Calico-VPP doesn't have that one, therefore knows nothing about NSE's network namespace. And when we try to create network interface - we receive an error.

Solutions

  1. The simplest solution - Use hostPID:true for the forwarder by default - see comment - https://github.com/networkservicemesh/sdk-vpp/issues/354#issuecomment-904665828
  2. Use shared directory between Forwarder and Calico, where we can create namespace fds. But here we need to know at the stage of creating the network interface whether we use our own VPP or from Calico.
  3. Create a proxy sidecar for Calico. This sidecar will handle some vpp api calls differently. We can send inode to the sidecar and create a unix connection between forwarder and sidecar to send fd

We think that 1 is the preferred solution at the moment. We can create an issue to use a different approach in future releases.

@edwarnicke What do you think? Is https://github.com/networkservicemesh/sdk-vpp/issues/354#issuecomment-904665828 still actual and we can use hostPID:true by default?