metroctl: hangs if CA doesn't match

Since we now verify the cluster CA (with the TOFU work) metroctl operations hang silently if the CA doesn't match the cluster being connected to.

2024/05/27 23:59:29 INFO: [core] [Channel #3 SubChannel #4]Subchannel created
2024/05/27 23:59:29 INFO: [core] [Channel #3]Channel Connectivity change to CONNECTING
2024/05/27 23:59:29 INFO: [core] [Channel #3]Channel exiting idle mode
2024/05/27 23:59:29 INFO: [core] [Channel #3 SubChannel #4]Subchannel Connectivity change to CONNECTING
2024/05/27 23:59:29 INFO: [core] [Channel #3 SubChannel #4]Subchannel picks a new address "172.17.188.173:7835" to connect
2024/05/27 23:59:29 INFO: [core] [pick-first-lb 0xc000191260] Received SubConn state update: 0xc0001912f0, {ConnectivityState:CONNECTING ConnectionError:<nil>}
2024/05/27 23:59:29 INFO: [core] Creating new client transport to "{Addr: \"172.17.188.173:7835\", ServerName: \"172.17.188.173:7835\", }": connection error: desc = "transport: authentication handshake failed: node certificate verification failed: signature veritifcation failed: x509: Ed25519 verification failure"
2024/05/27 23:59:29 WARNING: [core] [Channel #3 SubChannel #4]grpc: addrConn.createTransport failed to connect to {Addr: "172.17.188.173:7835", ServerName: "172.17.188.173:7835", }. Err: connection error: desc = "transport: authentication handshake failed: node certificate verification failed: signature veritifcation failed: x509: Ed25519 verification failure"
2024/05/27 23:59:29 INFO: [core] [Channel #3 SubChannel #4]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: authentication handshake failed: node certificate verification failed: signature veritifcation failed: x509: Ed25519 verification failure"
2024/05/27 23:59:29 INFO: [core] [pick-first-lb 0xc000191260] Received SubConn state update: 0xc0001912f0, {ConnectivityState:TRANSIENT_FAILURE ConnectionError:connection error: desc = "transport: authentication handshake failed: node certificate verification failed: signature veritifcation failed: x509: Ed25519 verification failure"}
2024/05/27 23:59:29 INFO: [core] [Channel #3]Channel Connectivity change to TRANSIENT_FAILURE
2024/05/27 23:59:29 INFO: [core] [Channel #3]Channel Connectivity change to SHUTDOWN
2024/05/27 23:59:29 INFO: [core] [Channel #3]Closing the name resolver
2024/05/27 23:59:29 INFO: [core] [Channel #3]ccBalancerWrapper: closing
2024/05/27 23:59:29 INFO: [core] [Channel #3 SubChannel #4]Subchannel Connectivity change to SHUTDOWN
2024/05/27 23:59:29 INFO: [core] [Channel #3 SubChannel #4]Subchannel deleted
2024/05/27 23:59:29 INFO: [core] [Channel #3]Channel deleted

I think this is a larger problem than just this particular scenario.

Generally it's a problem of detecting and bubbling up persistent errors in the RPC resolver library. Currently we treat every error as transient and just fall back to our retry loop. This is generally the preferred behavior for when the RPC resolver library is used by machines to talk to other machines, and where we assume there are no client misconfigurations that should result in permanent errors. But obviously this doesn't work the same way when dealing with metroctl that has been possibly misconfigured.

I think an immediate enhancement would be to at least bubble up resolver errors to the user (slightly cleaned up) so that they at least have the ability to judge that something's going wrong.

Alternatively there could be a way to configure the resolver (or even the client channel?) in a fail-early mode which interactive tools would use - similar to some of the (deprecated) options in gRPC dialing/calling.

monogon-dev / monogon

metroctl: hangs if CA doesn't match #302