submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.4k stars 188 forks source link

Something wrong on rancher 2.2.1 (2.1.7 2.1.8) #14

Closed Negashev closed 5 years ago

Negashev commented 5 years ago

1) start rancher HA on rke add catalog https://github.com/rancher/submariner-charts

2) create cluster broker

addon_job_timeout: 30
authentication: 
  strategy: "x509"
bastion_host: 
  ssh_agent_auth: false
ignore_docker_version: true
# 
#   # Currently only nginx ingress provider is supported.
#   # To disable ingress controller, set `provider: none`
#   # To enable ingress on specific nodes, use the node_selector, eg:
#      provider: nginx
#      node_selector:
#        app: ingress
# 
ingress: 
  provider: "nginx"
kubernetes_version: "v1.13.5-rancher1-2"
monitoring: 
  provider: "metrics-server"
# 
#   # If you are using calico on AWS
# 
#      network:
#        plugin: calico
#        calico_network_provider:
#          cloud_provider: aws
# 
#   # To specify flannel interface
# 
#      network:
#        plugin: flannel
#        flannel_network_provider:
#          iface: eth1
# 
#   # To specify flannel interface for canal plugin
# 
#      network:
#        plugin: canal
#        canal_network_provider:
#          iface: eth1
# 
network: 
  options: 
    flannel_backend_type: "vxlan"
  plugin: "canal"
restore: 
  restore: false
# 
#      services:
#        kube-api:
#          service_cluster_ip_range: 10.43.0.0/16
#        kube-controller:
#          cluster_cidr: 10.42.0.0/16
#          service_cluster_ip_range: 10.43.0.0/16
#        kubelet:
#          cluster_domain: cluster.local
#          cluster_dns_server: 10.43.0.10
# 
services: 
  etcd: 
    backup_config: 
      enabled: true
      interval_hours: 12
      retention: 6
    creation: "12h"
    extra_args: 
      election-timeout: "5000"
      heartbeat-interval: "500"
    retention: "72h"
    snapshot: false
  kube-api: 
    always_pull_images: false
    pod_security_policy: false
    service_node_port_range: "30000-32767"
  kubelet: 
    fail_swap_on: false
ssh_agent_auth: false
# 
#   # Rancher Config
# 
docker_root_dir: "/var/lib/docker"
enable_cluster_alerting: false
enable_cluster_monitoring: false
enable_network_policy: false
local_cluster_auth_endpoint: 
  enabled: false
name: "test-submariner-broker"

3) create east cluster

4) create west cluster

addon_job_timeout: 30
authentication: 
  strategy: "x509"
bastion_host: 
  ssh_agent_auth: false
dns: 
  provider: "kube-dns"
ignore_docker_version: true
# 
#   # Currently only nginx ingress provider is supported.
#   # To disable ingress controller, set `provider: none`
#   # To enable ingress on specific nodes, use the node_selector, eg:
#      provider: nginx
#      node_selector:
#        app: ingress
# 
ingress: 
  provider: "nginx"
kubernetes_version: "v1.13.5-rancher1-2"
monitoring: 
  provider: "metrics-server"
# 
#   # If you are using calico on AWS
# 
#      network:
#        plugin: calico
#        calico_network_provider:
#          cloud_provider: aws
# 
#   # To specify flannel interface
# 
#      network:
#        plugin: flannel
#        flannel_network_provider:
#          iface: eth1
# 
#   # To specify flannel interface for canal plugin
# 
#      network:
#        plugin: canal
#        canal_network_provider:
#          iface: eth1
# 
network: 
  options: 
    flannel_backend_type: "vxlan"
  plugin: "canal"
restore: 
  restore: false
# 
#      services:
#        kube-api:
#          service_cluster_ip_range: 10.43.0.0/16
#        kube-controller:
#          cluster_cidr: 10.42.0.0/16
#          service_cluster_ip_range: 10.43.0.0/16
#        kubelet:
#          cluster_domain: cluster.local
#          cluster_dns_server: 10.43.0.10
# 
services: 
  etcd: 
    backup_config: 
      enabled: true
      interval_hours: 12
      retention: 6
    creation: "12h"
    extra_args: 
      election-timeout: "5000"
      heartbeat-interval: "500"
    retention: "72h"
    snapshot: false
  kube-api: 
    always_pull_images: false
    pod_security_policy: false
    service_cluster_ip_range: "10.1.0.0/16"
    service_node_port_range: "30000-32767"
  kube-controller: 
    cluster_cidr: "10.0.0.0/16"
    service_cluster_ip_range: "10.1.0.0/16"
  kubelet: 
    cluster_dns_server: "10.1.0.10"
    cluster_domain: "west.local"
    fail_swap_on: false
ssh_agent_auth: false
# 
#   # Rancher Config
# 
docker_root_dir: "/var/lib/docker"
enable_cluster_alerting: false
enable_cluster_monitoring: false
enable_network_policy: false
local_cluster_auth_endpoint: 
  enabled: false
name: "test-submariner-west"

5) test!

on west we have 2 nginx pods 10.0.0.5 10.0.1.4 on east we have 2 nginx pods 10.98.1.5 10.98.0.4

interest things

conclusion:

Cross cluster network works only on the machine on which the engine is running, as a result, only one server sees all the pods of another cluster

Negashev commented 5 years ago

Test again with flannel on west and east, nothing changed Rancher

Oats87 commented 5 years ago

Are your nodes within the same subnet?

Negashev commented 5 years ago

@Oats87 Yes, all machines under 10.6.x.x (openstack)

And physically all 3 clusters in one data center (10Gb network)

Negashev commented 5 years ago

Repeated on rancher/rancher:v2.1.8 (what am I doing wrong)

Negashev commented 5 years ago

@Oats87 Okay play with rancher/rancher:v2.1.7 (Repeated :sob: :scream: )

addon_job_timeout: 30
authentication: 
  strategy: "x509"
bastion_host: 
  ssh_agent_auth: false
ignore_docker_version: true
ingress: 
  provider: "nginx"
kubernetes_version: "v1.13.4-rancher1-1"
monitoring: 
  provider: "metrics-server"
network: 
  options: 
    flannel_backend_type: "vxlan"
  plugin: "canal"
services: 
  etcd: 
    creation: "12h"
    extra_args: 
      election-timeout: "5000"
      heartbeat-interval: "500"
    retention: "72h"
    snapshot: true
  kube-api: 
    pod_security_policy: false
    service_node_port_range: "30000-32767"
  kubelet: 
    fail_swap_on: false
ssh_agent_auth: false
addon_job_timeout: 30
authentication: 
  strategy: "x509"
bastion_host: 
  ssh_agent_auth: false
ignore_docker_version: true
ingress: 
  provider: "nginx"
kubernetes_version: "v1.13.4-rancher1-1"
monitoring: 
  provider: "metrics-server"
network: 
  options: 
    flannel_backend_type: "vxlan"
  plugin: "canal"
services: 
  etcd: 
    creation: "12h"
    extra_args: 
      election-timeout: "5000"
      heartbeat-interval: "500"
    retention: "72h"
    snapshot: true
  kube-api: 
    pod_security_policy: false
    service_cluster_ip_range: "10.61.0.0/16"
    service_node_port_range: "30000-32767"
  kube-controller: 
    cluster_cidr: "10.51.0.0/16"
    service_cluster_ip_range: "10.61.0.0/16"
  kubelet: 
    cluster_dns_server: "10.61.0.10"
    cluster_domain: "cluster1.local"
    fail_swap_on: false
ssh_agent_auth: false
addon_job_timeout: 30
authentication: 
  strategy: "x509"
bastion_host: 
  ssh_agent_auth: false
ignore_docker_version: true
ingress: 
  provider: "nginx"
kubernetes_version: "v1.13.4-rancher1-1"
monitoring: 
  provider: "metrics-server"
network: 
  options: 
    flannel_backend_type: "vxlan"
  plugin: "canal"
services: 
  etcd: 
    creation: "12h"
    extra_args: 
      election-timeout: "5000"
      heartbeat-interval: "500"
    retention: "72h"
    snapshot: true
  kube-api: 
    pod_security_policy: false
    service_cluster_ip_range: "10.62.0.0/16"
    service_node_port_range: "30000-32767"
  kube-controller: 
    cluster_cidr: "10.52.0.0/16"
    service_cluster_ip_range: "10.62.0.0/16"
  kubelet: 
    cluster_dns_server: "10.62.0.10"
    cluster_domain: "cluster2.local"
    fail_swap_on: false
ssh_agent_auth: false
Oats87 commented 5 years ago

@Negashev can you check the routing table on the other nodes (that are not the gateway host) to see if they were properly installed? You should see routing rules for the other cluster service/cluster CIDR's

Negashev commented 5 years ago

@Oats87 You mean ip route on machines? It's show route with cidr from another cluster only on one machine, on which he first started

Negashev commented 5 years ago

pod_on_cluster_1_machine_with_SUBmaster traceroute to pod_on_cluster_2 / # traceroute 10.98.0.14 traceroute to 10.98.0.14 (10.98.0.14), 30 hops max, 46 byte packets 1 10.76.0.1 (10.76.0.1) 0.009 ms 0.007 ms 0.092 ms 2 machine_with_SUBmaster_CLUSTER_2 (10.6.193.143) 0.554 ms 0.510 ms 0.403 ms 3 10.98.0.14 (10.98.0.14) 0.273 ms 0.560 ms 0.159 ms

pod_on_cluster_1_machine_WITHOUTSUBmaster / # traceroute 10.98.0.14 traceroute to 10.98.0.14 (10.98.0.14), 30 hops max, 46 byte packets 1 10.76.2.1 (10.76.2.1) 0.019 ms 0.010 ms 0.005 ms 2 machinewith_SUBmaster_CLUSTER_1 (10.6.193.144) 0.520 ms 0.536 ms 0.396 ms 3 4 5 6 7 8 9 10 11 *

Negashev commented 5 years ago

Repeate on Rancher 2.2.1 with hetzner cloud, Ubuntu 16 and flannel

Negashev commented 5 years ago

We play with tcpdump and ping. Find problem with reply packets to ping

it lose when node 1 (cluster 2) send reply to node 2 (cluster 2), but node 2 give nothing

Rancher (2)

Negashev commented 5 years ago

Repeat on Rancher 2.2.2 RKE flannel CentOS Linux 7 3.10.0-957.1.3.el7.x86_64

Oats87 commented 5 years ago

@Negashev Does the Hetzler Cloud enforce strict IP src/dst checks?

https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html#EIP_Disable_SrcDestCheck https://cloud.google.com/vpc/docs/using-routes#canipforward https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-network-interface#enable-or-disable-ip-forwarding

for the major 3 US clouds

Negashev commented 5 years ago

@Oats87 Wow ... hard question, I don't have extensive knowledge in this area, but I think not.

Negashev commented 5 years ago

@Oats87 We didn't use NAT on hetzner cloud and on our local cloud with 10.6.x.x cidr for machines

Negashev commented 5 years ago

We are turn off port-security on openstack and it helped (with reload nodes)

Negashev commented 2 years ago

Still does not work in hetzner (with rke2)