Open iancoffey opened 3 years ago
I really like the third approach, because doing that will make CLI to work automatically because it checks the status of Cluster resource not be an error, so if we bubble up the statuses and conditions from VSphereCluster to Cluster CLI will stop telling user that something is broken.
@gwang550 and I spent some time looking into different failure cases (and success cases!) and noticed that currently, the VCenterAvailable
status/condition is not reliable. It is often False
for long periods of time, even when machines are being successfully created. We will capture more details about this and open a separate issue upstream, but until that is resolved it may be tricky to take our desired approach here.
See also, issues #105 and #106
tl;dr: We’d like to add logic in the CLI to synchronously verify HTTPS connectivity from the bootstrap cluster to vSphere Server, before proceeding with management-cluster creation.
Describe the feature request
During our work with the deployment of IPv6 only clusters on vSphere, we noticed that when the bootstrap or cleanup cluster cannot reach to vSphere server it's very non-obvious for the user to understand the particular deployment failure reason as the only feedback it receives now deployment timeout.
So we (TKG networking) are interested in providing faster feedback to the user in these two cases where we think it is likely that misconfiguration will result in a failed deployment of the management cluster:
VSPHERE_SERVER
domain nameVSPHERE_SERVER
domain nameThere may be other similar cases that cause management cluster deploy to fail when for some reason, the cluster API provider is unable to reach the Infrastructure. In some cases, it is possible to check this from the CLI itself. We think this is already done to check that a client on the bootstrap machine is able to authenticate with the vSphere server. That covers a large class of misconfiguration issues and narrows the class of issues we need to be concerned about, namely, those in which the connectivity from clients within the bootstrap cluster fails, even though it succeeds from a client running directly on the bootstrap machine.
Describe alternatives you've considered
For deployment of IPv6-only clusters to mitigate Docker NAT issues, we considered adding
iptables6
rules, however, it may be too invasive and require privileges we can't expect to install this rule automatically.Affected product area (please put an X in all that apply)
Additional context
We tested different scenarios to observe the current CLI behavior.
HTTP_PROXY configured but not running or not available The deployment fails after reaching the default 30 minutes timeout. The following error message is an indication of the error in
capv-controller
:Besides that, the
VCenterAvailable
condition in the status ofVSphereCluster
is set toFalse
:IP tables do not include the masquerade rule
The installation fails as in the previous case with the same outcomes (logs and status).
VSPHERE_SERVER in NO_PROXY, DNS not configured
The installation fails due to timeout but this time the message in
capv-controller
is different:The status of
VSphereCluster
is the same as in two previous cases.Collated Context
Context from 2021-06-29 13:42:33 User: mcwumbly I think this is notable:
Given that, I can imagine three possible approaches for this issue:
These may not be mutually exclusive and we could opt for (1) initially with (3) as a follow up if it seems like a valuable upstream contribution, for instance.
@andrewsykim @yastij @Anuj2512 - any initial thoughts on these ideas?
Context from 2021-06-29 17:57:15 User: mike1808 I really like the third approach, because doing that will make CLI to work automatically because it checks the status of
Cluster
resource not be an error, so if we bubble up the statuses and conditions fromVSphereCluster
toCluster
CLI will stop telling user that something is broken.