vSphere connectivity check before creating management cluster

tl;dr: We’d like to add logic in the CLI to synchronously verify HTTPS connectivity from the bootstrap cluster to vSphere Server, before proceeding with management-cluster creation.

Describe the feature request

During our work with the deployment of IPv6 only clusters on vSphere, we noticed that when the bootstrap or cleanup cluster cannot reach to vSphere server it's very non-obvious for the user to understand the particular deployment failure reason as the only feedback it receives now deployment timeout.

So we (TKG networking) are interested in providing faster feedback to the user in these two cases where we think it is likely that misconfiguration will result in a failed deployment of the management cluster:

Management cluster deploy fails because CAPV unable to resolve the VSPHERE_SERVER domain name
- In our experience, it is often necessary to configure the Docker daemon with an IPv6 nameserver that can resolve this address or use a proxy that can resolve the VSPHERE_SERVER domain name
Management cluster deploy fails because return traffic cannot reach CAPV
- Our recommended solution is to configure a MASQUERADE rule using IP tables which Docker does automatically for IPv4, but not IPv6.

There may be other similar cases that cause management cluster deploy to fail when for some reason, the cluster API provider is unable to reach the Infrastructure. In some cases, it is possible to check this from the CLI itself. We think this is already done to check that a client on the bootstrap machine is able to authenticate with the vSphere server. That covers a large class of misconfiguration issues and narrows the class of issues we need to be concerned about, namely, those in which the connectivity from clients within the bootstrap cluster fails, even though it succeeds from a client running directly on the bootstrap machine.

Describe alternatives you've considered

For deployment of IPv6-only clusters to mitigate Docker NAT issues, we considered adding iptables6 rules, however, it may be too invasive and require privileges we can't expect to install this rule automatically.

Affected product area (please put an X in all that apply)

[ ] APIs
[ ] Addons
[x] CLI
[ ] Docs
[x] Installation
[ ] Plugin
[ ] Security
[ ] Test and Release
[x] User Experience

Additional context

We tested different scenarios to observe the current CLI behavior.

HTTP_PROXY configured but not running or not available The deployment fails after reaching the default 30 minutes timeout. The following error message is an indication of the error in capv-controller:

E0625 16:34:43.312469       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="Post \"https://sc1-01-dhcpv6-v2618-18bf.ipv6.eng.vmware.com/sdk\": proxyconnect tcp: dial tcp [fd01:0:106:7:0:82ff:feb6:354f]:3128: i/o timeout" "controller"="vspherevm" "name"="tkg-mgmt-vsphere-20210624153838-control-plane-56whx" "namespace"="tkg-system"

Besides that, the VCenterAvailable condition in the status of VSphereCluster is set to False:

status:
    conditions:
    - lastTransitionTime: "2021-06-24T22:43:53Z"
    message: Secret "tkg-mgmt-vsphere-20210624153838" not found
    reason: VCenterUnreachable
    severity: Error
    status: "False"
    type: VCenterAvailable

IP tables do not include the masquerade rule

The installation fails as in the previous case with the same outcomes (logs and status).

VSPHERE_SERVER in NO_PROXY, DNS not configured

The installation fails due to timeout but this time the message in capv-controller is different:

E0625 22:17:57.957052       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="Post \"https://sc1-01-dhcpv6-v2618-18bf.ipv6.eng.vmware.com/sdk\": dial tcp: lookup sc1-01-dhcpv6-v2618-18bf.ipv6.eng.vmware.com on [fd00:10:96::a]:53: server misbehaving" "controller"="vspherevm" "name"="tkg-mgmt-vsphere-20210625151046-control-plane-mtrbl" "namespace"="tkg-system"

The status of VSphereCluster is the same as in two previous cases.

Collated Context

Context from 2021-06-29 13:42:33 User: mcwumbly I think this is notable:

Besides that, the VCenterAvailable condition in the status of VSphereCluster is set to False:
status:
  conditions:
  - lastTransitionTime: "2021-06-24T22:43:53Z"
      message: Secret "tkg-mgmt-vsphere-20210624153838" not found
      reason: VCenterUnreachable
      severity: Error
      status: "False"
      type: VCenterAvailable
...

The status of VSphereCluster is the same as in two previous cases

Given that, I can imagine three possible approaches for this issue:

Don't rely on this status; Just adding a separate synchronous call somewhere (e.g. what is proposed above)
Introduce some check on this status into the CLI itself; bubble it up to the user and error if it is present for more than some amount of time (e.g. ~30s)
Introduce some timeout into Cluster API itself on that status where the error would be propagated up to the Cluster (which I think the CLI already checks for).

These may not be mutually exclusive and we could opt for (1) initially with (3) as a follow up if it seems like a valuable upstream contribution, for instance.

@andrewsykim @yastij @Anuj2512 - any initial thoughts on these ideas?

Context from 2021-06-29 17:57:15 User: mike1808 I really like the third approach, because doing that will make CLI to work automatically because it checks the status of Cluster resource not be an error, so if we bubble up the statuses and conditions from VSphereCluster to Cluster CLI will stop telling user that something is broken.

vmware-tanzu / tanzu-framework

vSphere connectivity check before creating management cluster #88

Collated Context