vmware-tanzu / tanzu-framework

Tanzu Framework provides a set of building blocks to build atop of the Tanzu platform and leverages Carvel packaging and plugins to provide users with a much stronger, more integrated experience than the loose coupling and stand-alone commands of the previous generation of tools.
Apache License 2.0
196 stars 193 forks source link

vSphere connectivity check before creating management cluster #88

Open iancoffey opened 3 years ago

iancoffey commented 3 years ago

tl;dr: We’d like to add logic in the CLI to synchronously verify HTTPS connectivity from the bootstrap cluster to vSphere Server, before proceeding with management-cluster creation.

Describe the feature request

During our work with the deployment of IPv6 only clusters on vSphere, we noticed that when the bootstrap or cleanup cluster cannot reach to vSphere server it's very non-obvious for the user to understand the particular deployment failure reason as the only feedback it receives now deployment timeout.

So we (TKG networking) are interested in providing faster feedback to the user in these two cases where we think it is likely that misconfiguration will result in a failed deployment of the management cluster:

There may be other similar cases that cause management cluster deploy to fail when for some reason, the cluster API provider is unable to reach the Infrastructure. In some cases, it is possible to check this from the CLI itself. We think this is already done to check that a client on the bootstrap machine is able to authenticate with the vSphere server. That covers a large class of misconfiguration issues and narrows the class of issues we need to be concerned about, namely, those in which the connectivity from clients within the bootstrap cluster fails, even though it succeeds from a client running directly on the bootstrap machine.

Describe alternatives you've considered

For deployment of IPv6-only clusters to mitigate Docker NAT issues, we considered adding iptables6 rules, however, it may be too invasive and require privileges we can't expect to install this rule automatically.

Affected product area (please put an X in all that apply)

Additional context

We tested different scenarios to observe the current CLI behavior.

HTTP_PROXY configured but not running or not available The deployment fails after reaching the default 30 minutes timeout. The following error message is an indication of the error in capv-controller:

E0625 16:34:43.312469       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="Post \"https://sc1-01-dhcpv6-v2618-18bf.ipv6.eng.vmware.com/sdk\": proxyconnect tcp: dial tcp [fd01:0:106:7:0:82ff:feb6:354f]:3128: i/o timeout" "controller"="vspherevm" "name"="tkg-mgmt-vsphere-20210624153838-control-plane-56whx" "namespace"="tkg-system"

Besides that, the VCenterAvailable condition in the status of VSphereCluster is set to False:

status:
    conditions:
    - lastTransitionTime: "2021-06-24T22:43:53Z"
    message: Secret "tkg-mgmt-vsphere-20210624153838" not found
    reason: VCenterUnreachable
    severity: Error
    status: "False"
    type: VCenterAvailable

IP tables do not include the masquerade rule

The installation fails as in the previous case with the same outcomes (logs and status).

VSPHERE_SERVER in NO_PROXY, DNS not configured

The installation fails due to timeout but this time the message in capv-controller is different:

E0625 22:17:57.957052       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="Post \"https://sc1-01-dhcpv6-v2618-18bf.ipv6.eng.vmware.com/sdk\": dial tcp: lookup sc1-01-dhcpv6-v2618-18bf.ipv6.eng.vmware.com on [fd00:10:96::a]:53: server misbehaving" "controller"="vspherevm" "name"="tkg-mgmt-vsphere-20210625151046-control-plane-mtrbl" "namespace"="tkg-system"

The status of VSphereCluster is the same as in two previous cases.

Collated Context

Context from 2021-06-29 13:42:33 User: mcwumbly I think this is notable:

Besides that, the VCenterAvailable condition in the status of VSphereCluster is set to False:

status:
  conditions:
  - lastTransitionTime: "2021-06-24T22:43:53Z"
      message: Secret "tkg-mgmt-vsphere-20210624153838" not found
      reason: VCenterUnreachable
      severity: Error
      status: "False"
      type: VCenterAvailable

...

The status of VSphereCluster is the same as in two previous cases

Given that, I can imagine three possible approaches for this issue:

  1. Don't rely on this status; Just adding a separate synchronous call somewhere (e.g. what is proposed above)
  2. Introduce some check on this status into the CLI itself; bubble it up to the user and error if it is present for more than some amount of time (e.g. ~30s)
  3. Introduce some timeout into Cluster API itself on that status where the error would be propagated up to the Cluster (which I think the CLI already checks for).

These may not be mutually exclusive and we could opt for (1) initially with (3) as a follow up if it seems like a valuable upstream contribution, for instance.

@andrewsykim @yastij @Anuj2512 - any initial thoughts on these ideas?

Context from 2021-06-29 17:57:15 User: mike1808 I really like the third approach, because doing that will make CLI to work automatically because it checks the status of Cluster resource not be an error, so if we bubble up the statuses and conditions from VSphereCluster to Cluster CLI will stop telling user that something is broken.

mcwumbly commented 3 years ago

I really like the third approach, because doing that will make CLI to work automatically because it checks the status of Cluster resource not be an error, so if we bubble up the statuses and conditions from VSphereCluster to Cluster CLI will stop telling user that something is broken.

@gwang550 and I spent some time looking into different failure cases (and success cases!) and noticed that currently, the VCenterAvailable status/condition is not reliable. It is often False for long periods of time, even when machines are being successfully created. We will capture more details about this and open a separate issue upstream, but until that is resolved it may be tricky to take our desired approach here.

See also, issues #105 and #106