Open gyliu513 opened 6 years ago
@gyliu513 I think this would be a great thing to document and I'd be happy to help out with any gotchas. I'll read through the other thread.
The main thing is that there are a LOT of different ways this could be set up, so a one-size-fits-all solution probably doesn't exist. I think we'll need a doc which covers the main things you need to consider when setting up, and then with some examples would be great.
Thanks @caseydavenport , I want to start with calico router reflector, seems router reflector 0.5.0 will not work based on discussion here https://github.com/projectcalico/calico/issues/908#issuecomment-385160303
I'm now trying with calico 2.6.6 and RR 0.4.2, will append more detail later for any issues.
Found an issue for 0.4.2 set up and opened an issue for my question at https://github.com/projectcalico/calico/issues/1948
Hey guys, any update on that ? We are looking at enabling pod-to-pod communication across Kubernetes cluster and it's difficult to find some documentation on that.
Cheers
@Camsteack Please take a look at this document https://medium.com/ibm-cloud/multi-cluster-support-for-service-mesh-with-ibm-cloud-private-d7d791f9b778 for how to configure pod-to-pod communication across Kubernetes cluster with Node to Node Mesh.
It is fairly straightforward, here are my notes, some terminology might not be accurate, I am not a networking guy, but it does work.
Ensure pod-pod communication between clusters A (pod cidr 172.22.8.0/21) and B (pod cidr 172.22.0.0/21).
Use bird to establish peering and routes exchange between clusters using eBGP (BGP between different AS). Given that master nodes have fixed IPs, they are good candidates to become BGP edge nodes and have peering mesh setup between them.
Default behaviour for eBGP peering is to export (announce) all route learned from iBGP (in-cluster mesh), so all we need is to configure each calico cluster using their own AS.
All nodes within cluster continue to use BGP mesh, no need to have route reflectors.
Calico uses set of custom patches to Bird which make it install IPIP route into Linux routing table. Precondition to that is that Calico needs to know upfront which routes are to be used over IPIP, it can't learn it from BGP peering alone. So to some extent dynamic nature for BGP is gimped: it needs to know upfront what it is going to learn from BGP.
Known subnets are configured with IPPool custom resource. Despite what description says, it used even if Calico IPAM (IP address assignment) is disabled.
Create following in each cluster (example is given for cluster A):
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: B-ippool
spec:
cidr: 172.22.0.0/21 # podCIDR used in cluser B
ipipMode: Always # crucial - configures Bird to use IPIP for learned routes within given CIDR
natOutgoing: false
disabled: true # 'disabled' option has effect for IP assignment purposes only
In each cluster create BGPConfiguration custom resource, with distinct AS number used in each cluster:
# calico-bgpconfig.yaml
apiVersion: crd.projectcalico.org/v1
kind: BGPConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
nodeToNodeMeshEnabled: true
# MUST BE UNIQUE PER CLUSTER:
# - default (no config): 64512
# - A: 64513
# - B: 64514
# - C: 64515
asNumber: 64513
Peering configuration is driven by BGPPeer custom resource. There must be a single BGPPeer resource for each remote peer node. Create following in each cluster (example is given for cluster A):
# calico-bgppeers.yaml
#
# Naming convention is: 'edge-peer-$REMOTE_NODE_NAME'
#
# Config below makes use of label 'edge'
# on Calico Node objects.
# IMPORTANT: for peering to work, you must:
# 1. ensure that TCP port 179 works between
# peering nodes
# 2. any BGPPeer config is symmetric:
# if you peer with node, that node
# MUST peer with you
#
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
name: edge-peer-B-node1
spec:
nodeSelector: has(edge)
peerIP: 172.25.5.212 # IP of node1 in cluster B (remote cluster)
asNumber: 64514 # AS used in cluster B (remote cluster)
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
name: edge-peer-B-node2
spec:
nodeSelector: has(edge) # apply this BGPPeer config to Calico Nodes with label 'edge'
peerIP: 172.25.5.213
asNumber: 64514
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
name: edge-peer-B-node3
spec:
nodeSelector: has(edge)
peerIP: 172.25.5.214
asNumber: 64514
Calico has it's own concept of Node. When using Kubernetes datastore Calico Node is tied to Kubernetes Node. Calico uses annotations on Kubernetes Node to "configure" Calico Node.
Earlier we created BGPPeer configuration which selects nodes with label edge
. This label is Calico Node label, not Kubernetes Node label. A bit confusingly, to set Calico Node label we need to set Kubernetes Node annotation.
Snippet below sets Calico edge
label to all Kubernetes Nodes with role master
. It overwrites full set of Calico labels, adjust accordingly if you already using Calico labels for other purposes:
kubectl get node -l node-role.kubernetes.io/master='' -o name \
| xargs --no-run-if-empty -I'{}' kubectl annotate --overwrite {} projectcalico.org/labels='{"edge":"true"}'
Once done, BGPPeer configuration done before kicks in and master nodes will attempt to peer with remote cluster master nodes. It can (and will) briefly render calico-node pods "NOT READY" due to default readiness check expecting all peers to be up, but for that you need to complete setup of all clusters which you are peering all. So you might want to silence alerts or reconfigure readiness check to stop checking bird status.
There is an environment variable FELIX_EXTERNALNODESCIDRLIST
which according to description required for Calico accept IPIP traffic, I didn't check if it is actually needed and set it before any of other steps in this guide, so it might be redundant, but that is what I had added to calico daemonsets (example is for cluster A):
containers:
# Runs calico/node container on each Kubernetes node. This
# container programs network policy and routes on each
# host.
- name: calico-node
image: quay.io/calico/node:v3.4.0
env:
# POD-POD cross cluster connectivity
- name: FELIX_EXTERNALNODESCIDRLIST
# cluster B cluster C
value: "172.25.5.0/24,10.1.44.0/24"
If all done correctly you should be able to ping any pod from any pod (or node) even across cluster (mind NetworkPolicies in both clusters, they might block it).
Upon closer inspection however, traffic goes via master nodes. It is not what we want, we want master nodes to exchange routes and propagate learned routes further into own cluster mesh, but they shouldn't be routing whole cross-cluster traffic, we want it to go directly to the node responsible for a given POD IP.
For that we need to make bird to export (advertise) routes without replacing BGP.next_hop
(see details in my comment to PR which supposedly attempted to solve it: https://github.com/projectcalico/node/pull/55#issuecomment-451733992)
calico-node generates bird.cfg using from a template file. Add following snippet to calico's Daemonset yaml file:
- name: calico-node
command:
- /bin/sh
- -c
# adds `next hop keep;` option to bird configuration.
# WARNING: quite fragile and version-specific, recheck on every calico version update
- sed -ie "/multihop/a next hop keep;" /etc/calico/confd/templates/bird.cfg.template && exec /sbin/start_runit
image: quay.io/calico/node:v3.4.0
There are mainly 2 tools you can use to debug "ip route" and birdcl
(bird control)
to see effective routes on a node use ip route
and it's sidekick ip monitor
. Every node has a small slice of total podCIDR (configured in kube-controller-manager) used to assign IPs for pods running on that node. It is stored in node's .spec.podCIDR, you can get current values for all nodes with following:
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.spec.podCIDR}{"\n"}{end}'
Each node's podCIDR is advertised by calico's bird daemon on that node. If all works correctly you should see dev tunl0 proto bird onlink
in ip route show
output for each node in local and remote cluster you configured peering with.
In calico-node container there is an invaluable Bird remote control binary. To use exec into calic-node container and run
birdcl -s /var/run/calico/bird.ctl
Your best friend here is show route all
(see bird remote control documentation for comprehensive help on this command).
Example use and output:
# peer name we announce to
# vvvvvvvvvvvv
bird> show route all export Node_172_25_5_212
# route we announce learned from peer name (node from our cluster in this case)
# | route as bird know it |
# | (not necessary match one |
# | installed into kernel) |
# | | |
# v v v
172.22.8.192/26 via 172.24.5.211 on ens192 [Mesh_172_24_5_211 2019-01-03] * (100/0) [AS64513i]
Type: BGP unicast univ
BGP.origin: IGP
BGP.as_path: 64513 # <-- our AS number
BGP.next_hop: 172.24.5.211 # <-- we tell remote peer that it should install route via this IPs
BGP.local_pref: 100
Confusingly , BGP peers are called "protocols" in bird (BGP?) terminology.
@redbaron I was able to follow your guide and get the routes published in both clusters, but then the pods cannot actually reach each other because by default (in calico 3.7) IP-in-IP traffic from nodes that are not part of this cluster are automatically filtered using iptables.
There is an ipset
called cali40all-hosts-net
that includes the FELIX_EXTERNALNODESCIDRLIST
and the server IPs of the current cluster, but it does not include the node IPs of the other cluster - thus traffic is dropped.
@kesor you need to define the remote cluster ip pool like:
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: dev-gcp-pods
spec:
cidr: 10.4.0.0/16
ipipMode: CrossSubnet
disabled: true
in each cluster and also you need to let felix know about the remote hosts subnet in FELIX_EXTERNALNODESCIDRLIST
:
env:
- name: FELIX_EXTERNALNODESCIDRLIST
value: '10.10.0.0/24'
in order for calico/bird to accept the routes and insert the correct paths into your route tables.
After that you might need network policies to allow the whole remote cluster pod ip range.
We are doing a similar approach to @redbaron solution but we are using gobgp
instances because we need the flexibility of dynamic ip addresses for our cluster nodes.
We created a small blog post to describe it here: https://uw-labs.github.io/blog/kubernetes/2019/05/01/cross-cluster-comms.html
@ffilippopoulos following the guide by @redbaron worked well for us, the only missing piece was FELIX_EXTERNALNODESCIDRLIST
which we wrongly assumed was a list of POD network CIDRs and not NODE networks :/
@redbaron's guide also worked for me, will also be good to have documentation on how to setup using route reflectors! Is it also possible to propagate service IP addresses so those could be routed as well among clusters?
@redbaron is this still working for you with k8s v1.15.0? In my test using the exact same manifest that I used for v.1.14.4 only the nodes can that are listed as primary path in the bgp routing table are able to reach the remote cluster... Seems like something has changed.
I haven updated to 1.15 yet, thanks for the heads up. If you find what changed please post it here
Sent from ProtonMail mobile
-------- Original Message -------- On 15 Jul 2019, 12:46, Raúl Garcia Sanchez wrote:
@redbaron is this still working for you with k8s v1.15.0? In my test using the exact same manifest that I used for v.1.14.4 only the nodes can that are listed as primary path in the bgp routing table are able to reach the remote cluster... Seems like something has changed.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I just run another test. It seems like with 1.15 it only works when applying the "next hop keep" tweak that you mentioned above. Otherwise nodes that are not the next hop are not able to reach the remote cluster. This however was working without any issues in 1.14.4 also without the "next hop keep" tweak. Have you been able to reproduce it? I can send you the used manifests if you want. They are basically following your notes from above without the traffic optimization part
@redbaron how do you advertise service ip addresses when using the
sed -ie "/multihop/a next hop keep;" /etc/calico/confd/templates/bird.cfg.template && exec /sbin/start_runit
tweak?
I don't understand your question. Service with external traffic policy local works just fine with this setup.
-------- Original Message -------- On 30 Jul 2019, 11:34, Raúl Garcia Sanchez wrote:
@redbaron how do you advertise service ip addresses when using the " sed -ie "/multihop/a next hop keep;" /etc/calico/confd/templates/bird.cfg.template && exec /sbin/start_runit" tweak?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Yeah it was my fault - I totally forgot that I had to add the service address CIDR to the calico ippool...
I just run another test. It seems like with 1.15 it only works when applying the "next hop keep" tweak that you mentioned above. Otherwise nodes that are not the next hop are not able to reach the remote cluster. This however was working without any issues in 1.14.4 also without the "next hop keep" tweak. Have you been able to reproduce it? I can send you the used manifests if you want. They are basically following your notes from above without the traffic optimization part
Is there any better way to do this in 3.17?
@dimm0 I have not seen any other way then to use this hack. It is currently working on version v3.15 clusters.
Maybe @caseydavenport or @redbaron know more
As far as I understood, BGP and IPIP encapsulation cannot be used on Microsoft Azure since their vnet is entirely controlling packet routing. We used vxlan for intra-cluster setup. Is calico cluster mesh working on azure ? If there are any doc on this, that'd be perfect !
@redbaron , Is this guide still valid for v3.19.1. we tried on k8s 1.19. Connections established but pods do not communicate. We noticed that for all pods deployed in cluster 2, routes on cluster 1 nodes shows the gateway as master1, although we have 3 master nodes. And, a packet sent from pod1->node1->cluster1 reaches cluster2->master1->decapsulates and drops. All firewalls and security groups are taken care. @tushar00jain , were there any tweaks in your setup besides above guide(incase, if you are still running the setup?)
Hello @laxmanvallandas
We are running a cross-cluster setup that heavily leverages calico, we have a write up here: https://uw-labs.github.io/blog/kubernetes,/multicluster/2021/07/21/kube-semaphore-intro.html
Hopefully that will be useful to you.
@george-angel , That's exactly where i started with ;) . But, found that pod-to-pod communication is the prerequisite as mentioned here.
@laxmanvallandas In this: https://uw-labs.github.io/blog/kubernetes,/multicluster/2021/07/21/kube-semaphore-intro.html pod to pod connectivity is achieved via a WireGuard mesh network created with: https://github.com/utilitywarehouse/semaphore-wireguard. This will configure local wg peers on all nodes that can route traffic for the respective remote pod sunbnets (assuming that you have node to node connectivity between cluster1 and cluster2). It is an alternative to the BGP route reflector approach described in this issue, with additional encryption provided from WireGuard. The service-mirroring and policies can be used regardless of the method you select to implement cross cluster pods communication.
@ffilippopoulos , Thanks a ton. I think i kind of lost this while experimenting with multiple ideas across globe ;) . I will give a try and keep you all posted.
Seems many people interested in this feature, we should have a document in Calico for how to setup.
This are some discussion and I still have some trouble to set up such clusters, but I can work on the document once cross cluster works.
/cc @caseydavenport @tmjd @karstensi