submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.43k stars 192 forks source link

"Error while adding route" when trying to setup Submariner #26

Closed ishantanu closed 4 years ago

ishantanu commented 5 years ago

Hi,

Following is the setup with which I am trying submariner:

  1. One 3-node cluster on AWS.
  2. Two 3-node bare-metal clusters which are reachable on the internet.

So, the thing is:

  1. I tried to setup submariner by considering one bare-metal cluster as broker. It worked. But, later on, I wanted to change the instance types of the worker nodes on AWS.
  2. I changed the instance type of worker nodes via Rancher GUI and new workers were added.

After that, both the submariner route pods (on both the clusters) are showing this error:

E0507 12:02:19.649758       1 route.go:385] error while adding route {Ifindex: 2 Dst: 10.45.0.0/16 Src: <nil> Gw: XX.XXX.XXX.XXX Flags: [] Table: 0}: file exists

So, I deleted the complete submariner installation on all clusters and retried it. But, I still see the same error.

Where are these thing stored? How do I make Submariner work again?

UPDATE: I tried the whole sequence one more time by deleting helm releases and recreating everything. It still shows the same error.

I even recreated the clusters with different CIDR's and they all still show the same error message. Only the CIDR values get changed.

E0507 14:09:19.179172       1 route.go:385] error while adding route {Ifindex: 2 Dst: 10.53.0.0/16 Src: <nil> Gw: 192.168.0.189 Flags: [] Table: 0}: file exists
E0507 14:09:19.179210       1 route.go:385] error while adding route {Ifindex: 2 Dst: 10.54.0.0/16 Src: <nil> Gw: 192.168.0.189 Flags: [] Table: 0}: file exists
Oats87 commented 5 years ago

That message is actually ok, is Submariner functional? Next release will include logic so that it won't try to constantly add a new route.

ishantanu commented 5 years ago

Nope, it is not. I tried creating an nginx pod in one cluster and pinged the pod IP from another cluster. It did not work.

Oats87 commented 5 years ago

Did you disable strict source/destination checking on the AWS nodes?

ishantanu commented 5 years ago

Yes, I did.

ishantanu commented 5 years ago

Just for the info, here are the details of three clusters.

  1. Bare-metal cluster (acting as broker) created via RKE CLI:

    domain: cluster.local
    cluster CIDR: 10.40.0.0/16
    service CIDR: 10.41.0.0/16
  2. Bare-metal cluster created from Rancher GUI:

    domain: xyz.cluster.local
    cluster CIDR: 10.53.0.0/16
    service CIDR: 10.54.0.0/16
  3. AWS cluster created from Rancher GUI:

    domain: aws.cluster.local
    cluster CIDR: 10.51.0.0/16
    service CIDR: 10.52.0.0/16

Networking plugin used on all clusters: Flannel.

Oats87 commented 5 years ago

OK. I have before had to reboot my nodes during an HA failover that was unsuccessful, this is almost what this feels like here; is it possible for you to reboot your worker nodes? I know that's a bit of a big hammer for the situation, but should help. There is an issue with updating the iptables/ipsec rules that still isn't fully debugged at this point.

ishantanu commented 5 years ago

Well, I already did a reboot to all the worker nodes of a non-broker bare-metal cluster and it did not work at that time. But, let me try it one more time.

ishantanu commented 5 years ago

A reboot to the worker nodes is done as well. It still does not work.

Oats87 commented 5 years ago

on the elected gateway nodes, can you perform an ip xfrm state to see if you have established IPsec tunnels?

ishantanu commented 5 years ago

It returns nothing:

ubuntu@gateway:~$ sudo ip xfrm state
ubuntu@gateway:~$
Oats87 commented 5 years ago

This is indicative of the ipsec tunnels not being established properly; are you able to restart the submariner gateway pod and take a look at the logs to see if the messages from StrongSwan related to establishing the tunnel?

ishantanu commented 5 years ago

Okay. So, I restarted the gateway pods on both cluster and the logs mentioned messages from StrongSwan with errors:

00[DMN] Starting IKE charon daemon (strongSwan 5.5.1, Linux 4.15.0-47-generic, x86_64)
00[KNL] unable to create IPv4 routing table rule
00[KNL] unable to create IPv6 routing table rule
00[CFG] loading ca certificates from '/usr/local/etc/ipsec.d/cacerts'
00[LIB] opening directory '/usr/local/etc/ipsec.d/cacerts' failed: No such file or directory
00[CFG]   reading directory failed
00[CFG] loading aa certificates from '/usr/local/etc/ipsec.d/aacerts'
00[LIB] opening directory '/usr/local/etc/ipsec.d/aacerts' failed: No such file or directory
00[CFG]   reading directory failed
00[CFG] loading ocsp signer certificates from '/usr/local/etc/ipsec.d/ocspcerts'
00[LIB] opening directory '/usr/local/etc/ipsec.d/ocspcerts' failed: No such file or directory
00[CFG]   reading directory failed
00[CFG] loading attribute certificates from '/usr/local/etc/ipsec.d/acerts'
00[LIB] opening directory '/usr/local/etc/ipsec.d/acerts' failed: No such file or directory
00[CFG]   reading directory failed
00[CFG] loading crls from '/usr/local/etc/ipsec.d/crls'
00[LIB] opening directory '/usr/local/etc/ipsec.d/crls' failed: No such file or directory
00[CFG]   reading directory failed

@Oats87 Any idea what might be wrong? I tried recreating the clusters with new instances and somehow, the error stays the same.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity for 120 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.

eremcan commented 11 months ago

Have you ever resolve the issue? I am facing the same one.