submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.43k stars 193 forks source link

Epic: Gateway HA with OpenShift #1180

Closed nyechiel closed 3 years ago

nyechiel commented 3 years ago

Out of the box, ensure that the Submariner gateway is deployed in active/passive HA mode on OpenShift-based clusters, with at least one additional node (other than the active one) labeled and ready to take over.

Previous related work from @sridhargaddam:

  1. https://github.com/submariner-io/submariner-operator/issues/586
  2. https://docs.google.com/spreadsheets/d/1JsXsyRDDXkp6t55Gm-NP5EggWTyYi2yo27pyuDYwlpc/edit#gid=0

Goal:

When deploying with OpenShift, ensure that there are at least two nodes in two different AZs that are labeled properly and can act as gateways. The idea is that in case of a gateway node or AZ failure, data path traffic will be minimally impacted.

Key work items:

  1. Deploy this manually on AWS and test what happens...

    • Is there a way to disconnect an AZ, or shall we remove all instances on that AZ? (no normal power off but forced stop/shutdown)
    • When an AZ is down, will the GW on the disconnected AZ leave leadership? Will the passive GW take leadership?
    • What happens when the AZ goes back?
  2. Assuming tests are OK, how should we automate the deployment piece?

    • For ACM/cloud-prepare repo (for OpenShift/IPI, we need to deploy an additional node)
    • MVP is AWS, but we will need to figure out the plan for other providers (GCP, Azure, IBM)
  3. What happens in combination with NAT (on-prem)

    • Probably each gateway node would need a set of ports (different ones), which can be mapped from a router
    • For example:
      oc label node $node1 submariner.io/ipsec-natt-port=4500
      oc label node $node1 submariner.io/ipsec-ike-port=500
      oc label node $node2 submariner.io/ipsec-natt-port=4501
      oc label node $node2 submariner.io/ipsec-ike-port=501
  4. Document the expected behavior. We need to set the right expectations and explain how the system work

aswinsuryan commented 3 years ago

I tried testing HA in a two cluster setup, with each cluster in a different region in AWS without global-net enabled. To create an AZ failure added a network ACL to deny all the ingress and egress traffic to the subnet of the AZ. This resulted in the nodes in the AZ going to a not-ready state.

The cluster1 and cluster2 hade two-gateways in active-passive mode. The AZ failure was simulated in the cluster1 node where the active g/w pod is running. Observations: 1) The passive g/w took over the active role in cluster1 almost immediately 2) The new endpoint object took almost 4 minutes to propagate to cluster-2 3) Connectivity tests passed after the convergence. 4) Test continue to pass even the AZ came back.

Note: Sometimes the 1-2 tests failed on the first attempt after convergence, but consistently passed after that.

mangelajo commented 3 years ago

ok, this is good new, thank you @aswinsuryan.

I think we need to investigate "2", why did it take 4 minutes to propagate. I suspect that since cluster-1 was the broker (talking about that in slack with you) may be cluster-2 gateway was connected to the broker API on the failed AZ, took time to detect the change.

mangelajo commented 3 years ago

Can we re-test by making the AZ failure on the non-broker cluster?

aswinsuryan commented 3 years ago

I tried it in the non-broker cluster and the changes got propagated almost immediately `[asuryana@localhost openshift-aws]$ kubectl --kubeconfig asuryana-cluster-a/auth/kubeconfig get endpoints.submariner.io -n submariner-k8s-broker NAME AGE cluster-a-submariner-cable-cluster-a-10-0-51-51 103m cluster-b-submariner-cable-cluster-b-10-0-179-49 8s [asuryana@localhost openshift-aws]$ kubectl --kubeconfig asuryana-cluster-a/auth/kubeconfig get endpoints.submariner.io -n submariner-operator NAME AGE cluster-a-submariner-cable-cluster-a-10-0-51-51 103m cluster-b-submariner-cable-cluster-b-10-0-179-49 24s [asuryana@localhost openshift-aws]$ kubectl --kubeconfig asuryana-cluster-/auth/kubeconfig get endpoints.submariner.io -n submariner-operator error: stat asuryana-cluster-/auth/kubeconfig: no such file or directory [asuryana@localhost openshift-aws]$ kubectl --kubeconfig asuryana-cluster-b/auth/kubeconfig get endpoints.submariner.io -n submariner-operator NAME AGE cluster-a-submariner-cable-cluster-a-10-0-51-51 103m cluster-b-submariner-cable-cluster-b-10-0-179-49 47s [asuryana@localhost openshift-aws]$

`

qiujian16 commented 3 years ago

That looks great, I think we need to test the impact to service access with gateway HA. Would you also points me to an gateway HA setup document links?

aswinsuryan commented 3 years ago

@qiujian16 How do you setup the AWS for deploying submariner? Do you use the cloud prepare script? If you mark more than one nodes as gateway capable the submariner will deploy the node as gateway capable. I could prepare a document based once I know how this done.

qiujian16 commented 3 years ago

yes, we use something similar as cloud prepare scripts, but a golang implementation https://github.com/open-cluster-management/submariner-addon/blob/main/pkg/cloud/aws/aws.go. So in this case, we just need to provision multiple nodes and label them all as gw node, is it correct?

aswinsuryan commented 3 years ago

yes, we use something similar as cloud prepare scripts, but a golang implementation https://github.com/open-cluster-management/submariner-addon/blob/main/pkg/cloud/aws/aws.go. So in this case, we just need to provision multiple nodes and label them all as gw node, is it correct?

yes that is right. You can start with two nodes I think.

mangelajo commented 3 years ago

And those nodes need to be on separate availability zones.

On Mon, Mar 15, 2021 at 11:18 AM Aswin Suryanarayanan < @.***> wrote:

yes, we use something similar as cloud prepare scripts, but a golang implementation https://github.com/open-cluster-management/submariner-addon/blob/main/pkg/cloud/aws/aws.go. So in this case, we just need to provision multiple nodes and label them all as gw node, is it correct?

yes that is right. You can start with two nodes I think.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/submariner-io/submariner/issues/1180#issuecomment-799299369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI7G4RTUGWLF4RC4SA4H5DTDXNGHANCNFSM4YRIK4YQ .

-- Miguel Ángel Ajo @mangel_ajo https://twitter.com/mangel_ajo OpenShift / Kubernetes / Multi-cluster Networking team. ex OSP / Networking DFG, OVN Squad Engineering

aswinsuryan commented 3 years ago

@mangelajo Since cloud prepare is not yet ready shall we modify the ocp-ipi-aws-prep prepare scripts to deploy gateway HA for IPI deployments? If so, should be it the default option? and should it be configurable?

sridhargaddam commented 3 years ago

@mangelajo Since cloud prepare is not yet ready shall we modify the ocp-ipi-aws-prep prepare scripts to deploy gateway HA for IPI deployments? If so, should be it the default option? and should it be configurable?

If we decide to proceed with updating prep_for_subm.sh (/ocp-ipi-aws-prep) script, IMHO its better to make it configurable and NOT default. Most of the time we use AWS Clusters for testing regular use-cases (and NOT HA scenarios). If we deploy an additional node (with public-IP) in a different AZ, as you know, it would increase the cost as well.

aswinsuryan commented 3 years ago

Reopening to track the ACM part and documentation.

aswinsuryan commented 3 years ago

Tested with AZ failure on AWS with one cluster in on-prem and it works. Did not use any explicit port configuration for the on-prem.

It works vice-versa also when the gateway-label is removed from an on-prem cluster and failure is simulated.

In either case, there is a delay in the for the passive g/w to change it to connected. Like 7 minutes when checked the log once.

aswinsuryan commented 3 years ago

The delay in 7 minutes is observed while connecting the clusters via subctl join too. The issue is related to libreswan reporting active connection when one cluster is on-prem. This log is seen in the active gateway pod of the AWS cluster

I0409 09:15:33.753183 1 tunnel.go:63] Tunnel controller successfully installed Endpoint cable submariner-cable-asuryana-cluster-b-172-18-0-5 in the engine I0409 09:15:33.753820 1 libreswan.go:181] Connection "submariner-cable-asuryana-cluster-b-172-18-0-5-0-0" not found in active connections obtained from whack: map[], map[] I0409 09:15:33.753838 1 libreswan.go:181] Connection "submariner-cable-asuryana-cluster-b-172-18-0-5-0-1" not found in active connections obtained from whack: map[], map[] I0409 09:15:33.753847 1 libreswan.go:181] Connection "submariner-cable-asuryana-cluster-b-172-18-0-5-0-2" not found in active connections obtained from whack: map[], map[]

The one-prem cluster does not seem to have this issue in the logs.

It would be worth re-testing this with 0.9.0 release as there are changes in the way we detect NAT.