projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.83k stars 1.3k forks source link

Document for how to configure pod communication cross Kubernetes Clusters #1942

Open gyliu513 opened 6 years ago

gyliu513 commented 6 years ago

Seems many people interested in this feature, we should have a document in Calico for how to setup.

This are some discussion and I still have some trouble to set up such clusters, but I can work on the document once cross cluster works.

/cc @caseydavenport @tmjd @karstensi

caseydavenport commented 6 years ago

@gyliu513 I think this would be a great thing to document and I'd be happy to help out with any gotchas. I'll read through the other thread.

The main thing is that there are a LOT of different ways this could be set up, so a one-size-fits-all solution probably doesn't exist. I think we'll need a doc which covers the main things you need to consider when setting up, and then with some examples would be great.

gyliu513 commented 6 years ago

Thanks @caseydavenport , I want to start with calico router reflector, seems router reflector 0.5.0 will not work based on discussion here https://github.com/projectcalico/calico/issues/908#issuecomment-385160303

I'm now trying with calico 2.6.6 and RR 0.4.2, will append more detail later for any issues.

gyliu513 commented 6 years ago

Found an issue for 0.4.2 set up and opened an issue for my question at https://github.com/projectcalico/calico/issues/1948

Camsteack commented 6 years ago

Hey guys, any update on that ? We are looking at enabling pod-to-pod communication across Kubernetes cluster and it's difficult to find some documentation on that.

Cheers

gyliu513 commented 6 years ago

@Camsteack Please take a look at this document https://medium.com/ibm-cloud/multi-cluster-support-for-service-mesh-with-ibm-cloud-private-d7d791f9b778 for how to configure pod-to-pod communication across Kubernetes cluster with Node to Node Mesh.

redbaron commented 5 years ago

It is fairly straightforward, here are my notes, some terminology might not be accurate, I am not a networking guy, but it does work.

Goal

Ensure pod-pod communication between clusters A (pod cidr 172.22.8.0/21) and B (pod cidr 172.22.0.0/21).

Existing setup

Planned implementation

Use bird to establish peering and routes exchange between clusters using eBGP (BGP between different AS). Given that master nodes have fixed IPs, they are good candidates to become BGP edge nodes and have peering mesh setup between them.

Default behaviour for eBGP peering is to export (announce) all route learned from iBGP (in-cluster mesh), so all we need is to configure each calico cluster using their own AS.

All nodes within cluster continue to use BGP mesh, no need to have route reflectors.

Implementation details

Make calico use IPIP when reaching remote pod network

Calico uses set of custom patches to Bird which make it install IPIP route into Linux routing table. Precondition to that is that Calico needs to know upfront which routes are to be used over IPIP, it can't learn it from BGP peering alone. So to some extent dynamic nature for BGP is gimped: it needs to know upfront what it is going to learn from BGP.

Known subnets are configured with IPPool custom resource. Despite what description says, it used even if Calico IPAM (IP address assignment) is disabled.

Create following in each cluster (example is given for cluster A):

apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: B-ippool
spec:
  cidr: 172.22.0.0/21  # podCIDR used in cluser B
  ipipMode: Always   # crucial - configures Bird to use IPIP for learned routes within given CIDR
  natOutgoing: false
  disabled: true          # 'disabled' option has effect for IP assignment purposes only

Configure calico AS number

In each cluster create BGPConfiguration custom resource, with distinct AS number used in each cluster:

# calico-bgpconfig.yaml
apiVersion: crd.projectcalico.org/v1
kind: BGPConfiguration
metadata:
  name: default
spec:
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: true
  # MUST BE UNIQUE PER CLUSTER:
  # - default (no config): 64512
  # - A: 64513
  # - B: 64514
  # - C: 64515
  asNumber: 64513

Configure peering between master nodes

Peering configuration is driven by BGPPeer custom resource. There must be a single BGPPeer resource for each remote peer node. Create following in each cluster (example is given for cluster A):

# calico-bgppeers.yaml
#
# Naming convention is: 'edge-peer-$REMOTE_NODE_NAME'
#
# Config below makes use of label 'edge'
#  on Calico Node objects. 
# IMPORTANT: for peering to work, you must:
#  1. ensure that TCP port 179 works between
#     peering nodes
#  2. any BGPPeer config is symmetric:
#     if you peer with node, that node
#     MUST peer with you
#
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: edge-peer-B-node1
spec:
  nodeSelector: has(edge)
  peerIP: 172.25.5.212  # IP of node1 in cluster B  (remote cluster)
  asNumber: 64514      # AS used in cluster B (remote cluster)
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: edge-peer-B-node2
spec:
  nodeSelector: has(edge)  # apply this BGPPeer config to Calico Nodes with label 'edge'
  peerIP: 172.25.5.213
  asNumber: 64514
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: edge-peer-B-node3
spec:
  nodeSelector: has(edge)
  peerIP: 172.25.5.214
  asNumber: 64514

Assign Calico edge label to master nodes

Calico has it's own concept of Node. When using Kubernetes datastore Calico Node is tied to Kubernetes Node. Calico uses annotations on Kubernetes Node to "configure" Calico Node.

Earlier we created BGPPeer configuration which selects nodes with label edge. This label is Calico Node label, not Kubernetes Node label. A bit confusingly, to set Calico Node label we need to set Kubernetes Node annotation.

Snippet below sets Calico edge label to all Kubernetes Nodes with role master. It overwrites full set of Calico labels, adjust accordingly if you already using Calico labels for other purposes:

 kubectl get node -l node-role.kubernetes.io/master='' -o name \
 | xargs --no-run-if-empty -I'{}' kubectl annotate --overwrite {} projectcalico.org/labels='{"edge":"true"}'

Once done, BGPPeer configuration done before kicks in and master nodes will attempt to peer with remote cluster master nodes. It can (and will) briefly render calico-node pods "NOT READY" due to default readiness check expecting all peers to be up, but for that you need to complete setup of all clusters which you are peering all. So you might want to silence alerts or reconfigure readiness check to stop checking bird status.

Make calico accept IPIP traffic

There is an environment variable FELIX_EXTERNALNODESCIDRLIST which according to description required for Calico accept IPIP traffic, I didn't check if it is actually needed and set it before any of other steps in this guide, so it might be redundant, but that is what I had added to calico daemonsets (example is for cluster A):

      containers:
        # Runs calico/node container on each Kubernetes node.  This
        # container programs network policy and routes on each
        # host.
        - name: calico-node
          image: quay.io/calico/node:v3.4.0
          env:
            # POD-POD cross cluster connectivity
            - name: FELIX_EXTERNALNODESCIDRLIST
            #           cluster B     cluster C
              value: "172.25.5.0/24,10.1.44.0/24"

Optimize traffic flow

If all done correctly you should be able to ping any pod from any pod (or node) even across cluster (mind NetworkPolicies in both clusters, they might block it).

Upon closer inspection however, traffic goes via master nodes. It is not what we want, we want master nodes to exchange routes and propagate learned routes further into own cluster mesh, but they shouldn't be routing whole cross-cluster traffic, we want it to go directly to the node responsible for a given POD IP.

For that we need to make bird to export (advertise) routes without replacing BGP.next_hop (see details in my comment to PR which supposedly attempted to solve it: https://github.com/projectcalico/node/pull/55#issuecomment-451733992)

calico-node generates bird.cfg using from a template file. Add following snippet to calico's Daemonset yaml file:

       - name: calico-node
          command:
          - /bin/sh
          - -c
          # adds `next hop keep;` option to bird configuration. 
          # WARNING: quite fragile and version-specific, recheck on every calico version update
          - sed -ie "/multihop/a next hop keep;" /etc/calico/confd/templates/bird.cfg.template && exec /sbin/start_runit
          image: quay.io/calico/node:v3.4.0

Debugging

There are mainly 2 tools you can use to debug "ip route" and birdcl (bird control)

ip route

to see effective routes on a node use ip route and it's sidekick ip monitor. Every node has a small slice of total podCIDR (configured in kube-controller-manager) used to assign IPs for pods running on that node. It is stored in node's .spec.podCIDR, you can get current values for all nodes with following:

kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.spec.podCIDR}{"\n"}{end}'

Each node's podCIDR is advertised by calico's bird daemon on that node. If all works correctly you should see dev tunl0 proto bird onlink in ip route show output for each node in local and remote cluster you configured peering with.

birdcl

In calico-node container there is an invaluable Bird remote control binary. To use exec into calic-node container and run

birdcl -s /var/run/calico/bird.ctl

Your best friend here is show route all (see bird remote control documentation for comprehensive help on this command).

Example use and output:

#                        peer name we announce to
#                              vvvvvvvvvvvv
bird> show route all export Node_172_25_5_212

# route we announce                            learned from peer name (node from our cluster in this case)
#       |               route as bird know it      |                                         
#       |             (not necessary match one     |
#       |               installed into kernel)     |
#       |                   |                      |
#       v                   v                      v
172.22.8.192/26    via 172.24.5.211 on ens192 [Mesh_172_24_5_211 2019-01-03] * (100/0) [AS64513i]
    Type: BGP unicast univ
    BGP.origin: IGP
    BGP.as_path: 64513            # <-- our AS number
    BGP.next_hop: 172.24.5.211    # <-- we tell remote peer that it should install route via this IPs
    BGP.local_pref: 100

Confusingly , BGP peers are called "protocols" in bird (BGP?) terminology.

kesor commented 5 years ago

@redbaron I was able to follow your guide and get the routes published in both clusters, but then the pods cannot actually reach each other because by default (in calico 3.7) IP-in-IP traffic from nodes that are not part of this cluster are automatically filtered using iptables.

There is an ipset called cali40all-hosts-net that includes the FELIX_EXTERNALNODESCIDRLIST and the server IPs of the current cluster, but it does not include the node IPs of the other cluster - thus traffic is dropped.

ffilippopoulos commented 5 years ago

@kesor you need to define the remote cluster ip pool like:

apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: dev-gcp-pods
spec:
  cidr: 10.4.0.0/16
  ipipMode: CrossSubnet
  disabled: true

in each cluster and also you need to let felix know about the remote hosts subnet in FELIX_EXTERNALNODESCIDRLIST:

    env:
           - name: FELIX_EXTERNALNODESCIDRLIST
              value: '10.10.0.0/24'

in order for calico/bird to accept the routes and insert the correct paths into your route tables. After that you might need network policies to allow the whole remote cluster pod ip range. We are doing a similar approach to @redbaron solution but we are using gobgp instances because we need the flexibility of dynamic ip addresses for our cluster nodes. We created a small blog post to describe it here: https://uw-labs.github.io/blog/kubernetes/2019/05/01/cross-cluster-comms.html

kesor commented 5 years ago

@ffilippopoulos following the guide by @redbaron worked well for us, the only missing piece was FELIX_EXTERNALNODESCIDRLIST which we wrongly assumed was a list of POD network CIDRs and not NODE networks :/

tushar00jain commented 5 years ago

@redbaron's guide also worked for me, will also be good to have documentation on how to setup using route reflectors! Is it also possible to propagate service IP addresses so those could be routed as well among clusters?

rgarcia89 commented 5 years ago

@redbaron is this still working for you with k8s v1.15.0? In my test using the exact same manifest that I used for v.1.14.4 only the nodes can that are listed as primary path in the bgp routing table are able to reach the remote cluster... Seems like something has changed.

redbaron commented 5 years ago

I haven updated to 1.15 yet, thanks for the heads up. If you find what changed please post it here

Sent from ProtonMail mobile

-------- Original Message -------- On 15 Jul 2019, 12:46, Raúl Garcia Sanchez wrote:

@redbaron is this still working for you with k8s v1.15.0? In my test using the exact same manifest that I used for v.1.14.4 only the nodes can that are listed as primary path in the bgp routing table are able to reach the remote cluster... Seems like something has changed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rgarcia89 commented 5 years ago

I just run another test. It seems like with 1.15 it only works when applying the "next hop keep" tweak that you mentioned above. Otherwise nodes that are not the next hop are not able to reach the remote cluster. This however was working without any issues in 1.14.4 also without the "next hop keep" tweak. Have you been able to reproduce it? I can send you the used manifests if you want. They are basically following your notes from above without the traffic optimization part

rgarcia89 commented 5 years ago

@redbaron how do you advertise service ip addresses when using the

 sed -ie "/multihop/a next hop keep;" /etc/calico/confd/templates/bird.cfg.template && exec /sbin/start_runit

tweak?

redbaron commented 5 years ago

I don't understand your question. Service with external traffic policy local works just fine with this setup.

-------- Original Message -------- On 30 Jul 2019, 11:34, Raúl Garcia Sanchez wrote:

@redbaron how do you advertise service ip addresses when using the " sed -ie "/multihop/a next hop keep;" /etc/calico/confd/templates/bird.cfg.template && exec /sbin/start_runit" tweak?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rgarcia89 commented 5 years ago

Yeah it was my fault - I totally forgot that I had to add the service address CIDR to the calico ippool...

dimm0 commented 3 years ago

I just run another test. It seems like with 1.15 it only works when applying the "next hop keep" tweak that you mentioned above. Otherwise nodes that are not the next hop are not able to reach the remote cluster. This however was working without any issues in 1.14.4 also without the "next hop keep" tweak. Have you been able to reproduce it? I can send you the used manifests if you want. They are basically following your notes from above without the traffic optimization part

Is there any better way to do this in 3.17?

rgarcia89 commented 3 years ago

@dimm0 I have not seen any other way then to use this hack. It is currently working on version v3.15 clusters.

Maybe @caseydavenport or @redbaron know more

patjlm commented 3 years ago

As far as I understood, BGP and IPIP encapsulation cannot be used on Microsoft Azure since their vnet is entirely controlling packet routing. We used vxlan for intra-cluster setup. Is calico cluster mesh working on azure ? If there are any doc on this, that'd be perfect !

laxmanvallandas commented 2 years ago

@redbaron , Is this guide still valid for v3.19.1. we tried on k8s 1.19. Connections established but pods do not communicate. We noticed that for all pods deployed in cluster 2, routes on cluster 1 nodes shows the gateway as master1, although we have 3 master nodes. And, a packet sent from pod1->node1->cluster1 reaches cluster2->master1->decapsulates and drops. All firewalls and security groups are taken care. @tushar00jain , were there any tweaks in your setup besides above guide(incase, if you are still running the setup?)

george-angel commented 2 years ago

Hello @laxmanvallandas

We are running a cross-cluster setup that heavily leverages calico, we have a write up here: https://uw-labs.github.io/blog/kubernetes,/multicluster/2021/07/21/kube-semaphore-intro.html

Hopefully that will be useful to you.

laxmanvallandas commented 2 years ago

@george-angel , That's exactly where i started with ;) . But, found that pod-to-pod communication is the prerequisite as mentioned here.

ffilippopoulos commented 2 years ago

@laxmanvallandas In this: https://uw-labs.github.io/blog/kubernetes,/multicluster/2021/07/21/kube-semaphore-intro.html pod to pod connectivity is achieved via a WireGuard mesh network created with: https://github.com/utilitywarehouse/semaphore-wireguard. This will configure local wg peers on all nodes that can route traffic for the respective remote pod sunbnets (assuming that you have node to node connectivity between cluster1 and cluster2). It is an alternative to the BGP route reflector approach described in this issue, with additional encryption provided from WireGuard. The service-mirroring and policies can be used regardless of the method you select to implement cross cluster pods communication.

laxmanvallandas commented 2 years ago

@ffilippopoulos , Thanks a ton. I think i kind of lost this while experimenting with multiple ideas across globe ;) . I will give a try and keep you all posted.