Open fspmarshall opened 1 year ago
A related naming issue of note is that throughout lib/web
(as well as potentially other places) we're using an interface reversetunnel.RemoteSite
which is neither guaranteed to be a reverse tunnel nor a remote site (leaf cluster). It's really being used as an abstraction for a Teleport cluster, which can be either the local site (root cluster) in which case connections are not typically made through a reverse tunnel (afaik, I may be wrong about this), or else it's a remote site (leaf cluster) in which case connections are all made through a reverse tunnel.
The
lib/reversetunnel
package contains a lot of outdated and/or ambiguous terminology, and some fairly confusing log/error messages. This often causes confusion and wasted time during debugging/support. Addressing some of the below issues would be very helpful when reversetunnel issues come up:A lot of log/error messages can be improved, but one in particular has caused trouble recently: when dialing a leaf cluster via a root proxy, if no tunnels to the leaf cluster are held, the proxy ends up emitting a log line like this:
this proxy clusterPeer(TunnelConnection(name=<proxy-id>-<leaf-cluster-name>, type=, cluster=<leaf-cluster-name>, proxy=<proxy-id>)) has not been discovered yet
. Theproxy-id
listed in this message is a random proxy ID from the root cluster, not the ID of the proxy actually emitting the error log. In fact, that is typically a proxy ID of a proxy that is healthy. This often causes people to start looking for the problem on the wrong proxy. A message likeproxy <id-of-self> has no tunnels to leaf cluster <cluster-name>
would be much more useful.The
clusterPeer
type's purpose is very non-obvious from its name and comments. Especially, since the term "peer" is generally used in the context of one proxy dialing another proxy within the same cluster. This type's actual purpose is to be a placeholder representing a connection to a leaf cluster that has yet to register any tunnels, but looking at the code, and at the logs it produces, this can be non-obvious. Furthermore, multiple cluster peers are created based on the number of other proxies within the root cluster that do have connections to that leaf cluster, making the cluster peer type appear more complex/important than it really is when reading the code. Renaming it to something likeexpectedLeafCluster
, and reworking its errors and how it appears in log lines would be helpful.The root/leaf terminology used in describing cluster relationships is absent from much of
lib/reversetunnel
. A lot of types/comments/messages could benefit from naming reworks to better explain exactly what they are using the current terminology (sites
->clusters
,remote site
->leaf cluster
,cluster peer
->expected leaf cluster
,remote access point
->leaf cluster cache
, etc...).Reversetunnel resources leave their tunnel type as empty string if they represent a tunnel from a leaf proxy to a root proxy, which causes confusion when attempting to debug issues by listing active tunnel resources.
In addition to the above fixes, I think a periodic INFO-level log summarizing the current cluster connectivity status would be helpful. Something like:
It may also be useful to have a WARN-level log if a given leaf cluster is observed to have active tunnels to other proxies within the current cluster but not have any tunnels locally for some period of time. E.g.:
Care should be taken to avoid over-using this log, but if no tunnels for a given leaf are observed for more than the expiry time of the active peer tunnels, it is likely worth emitting it.