Open lorenz opened 1 month ago
I think this is a larger problem than just this particular scenario.
Generally it's a problem of detecting and bubbling up persistent errors in the RPC resolver library. Currently we treat every error as transient and just fall back to our retry loop. This is generally the preferred behavior for when the RPC resolver library is used by machines to talk to other machines, and where we assume there are no client misconfigurations that should result in permanent errors. But obviously this doesn't work the same way when dealing with metroctl that has been possibly misconfigured.
I think an immediate enhancement would be to at least bubble up resolver errors to the user (slightly cleaned up) so that they at least have the ability to judge that something's going wrong.
Alternatively there could be a way to configure the resolver (or even the client channel?) in a fail-early mode which interactive tools would use - similar to some of the (deprecated) options in gRPC dialing/calling.
Since we now verify the cluster CA (with the TOFU work) metroctl operations hang silently if the CA doesn't match the cluster being connected to.