gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
16.99k stars 1.71k forks source link

Machine ID: Trusted Cluster Support #13792

Closed strideynet closed 1 year ago

strideynet commented 2 years ago

As it stands, tbot only supports generating configuration and fetching certificates for the cluster it is directly connected to. Some customers will want to allow Machine ID to access resources in leaf clusters of that cluster.

We should seek to identify customers who have a need for leaf cluster access to validate some questions before embarking on this work.

Questions we need to answer:

It is entirely possible that using tsh with tbot provided identity gives all the functionality needed, but we should validate this and document this as the supported method for accessing leaf clusters.

Current state of affairs

I’ve tested Machine ID with SSH and a teleport Node in a leaf cluster with two variations:

In my examples:

Using tsh ssh

This works out of the box. Assuming your bot's destination directory is /dest and your bot is configured to output a identity file, you can do the following:

tsh -i /dest/identity --proxy teleport.local.ottr.sh ssh noahstride@leaf.leaf.local.ottr.sh --cluster=leaf.local.ottr.sh

I also noticed that the role in your root cluster that your tbot is assuming must include the principal you are trying to log into the node in the leaf cluster within logins. You must also have that principle within the leaf cluster role, or {{internal.logins}}.

Using openssh

This is a little more difficult, as the ssh_config generated by tbot doesn't include rules that will match hosts in leaf clusters. I was able to manually craft a ssh_config that did work though:

Host *.leaf.local.ottr.sh leaf.local.ottr.sh
    UserKnownHostsFile "/Users/noahstride/tbot/known_hosts"
    IdentityFile "/Users/noahstride/tbot/key"
    CertificateFile "/Users/noahstride/tbot/sshcert"
    HostKeyAlgorithms ssh-rsa-cert-v01@openssh.com
    PubkeyAcceptedAlgorithms +ssh-rsa-cert-v01@openssh.com

Host *.leaf.local.ottr.sh !leaf.local.ottr.sh
    Port 3022
    ProxyCommand "tbot" proxy --destination-dir=/Users/noahstride/tbot --proxy=teleport.local.ottr.sh ssh  %r@%h:%p

The known_hosts includes the host CA for the root cluster, but not for leaf clusters. This means that the user will be prompted to trust the host key on first connection.

You may notice that the --cluster directive is omitted from the ProxyCommand. When configuring this to the leaf cluster, tsh emits the following error:

ERROR: key for {ProxyHost:teleport.local.ottr.sh Username:bot-robot1 ClusterName:leaf.local.ottr.sh} not found

ERROR: unable to execute tsh
executing `tsh proxy`
exit status 1

It seems odd to me that this is emitted by tsh proxy ssh but not tsh ssh. I need to look into what's causing this, however, as long as the cluster directive is dropped, the rest of this works as expected.

Summary

This essentially gives the following tasks to improve tbot for use with openssh and leaf clusters:

strideynet commented 2 years ago

Some thoughts I had whilst writing this

Users will need a way to configure that they want certificates/config generated for a leaf cluster. My current suggestion is that we add a Cluster field to the DestinationConfig which allows the user to specify that they want generated configurations for a specific cluster, where this field is not specified, we should fall back to the cluster that tbot is directly connected to. My concern with this, is that it may be quite explicit for users who want to configure access to a large number of clusters. We may need to identify customers with the need for Machine ID leaf cluster support and see if this will serve their needs.

If we are going to generate configuration for leaf clusters, we will need to monitor Host CA rotations in leaf clusters, and ensure that we keep these up to date in the configured destinations. Scalability will be a concern here, some customers have thousands of leaf clusters and we want to ensure that we do not renew unnecessarily if something occurs in a leaf cluster.

anurag-work commented 1 year ago

@strideynet , we have root cluster at a region level (US/EU) and all the data centers within that region are leaf clusters. The users log into the root proxy and generate credentials. They then use those credentials to access leaf VMs via the leaf proxy. This allows the users to login to only login to 2 clusters and get access to all datacenters (<25)

To use machine ID ,we need a similar flow and to work with openssh since it's going to be used by various CI/CD tools and its not possible to use tsh for everything or login into several clusters .

strideynet commented 1 year ago

Hey @anurag-work, we've had a quick look at this and we believe it should be possible to directly connect to openssh servers set up against leaf clusters using credentials from a Machine ID instance configured against the root cluster. Please let us know if you experience any issues with this, and we can take a look.

anurag-work commented 1 year ago

@strideynet I am unable to connect to a leaf node using the certs generated by Machine ID which is configured against the root cluster. I've tried a few combinations of the tbot proxy command to match the what happens when I connect using certs generated by tsh. Even though the tsh proxy command matches , the session using tbot cert gets a permisson denied

tbot proxy commands that I tried

    ProxyCommand "/usr/local/bin/tbot" -d proxy --destination-dir=/Users/anurag/tbot-user --proxy=ctf01-teleport-proxy.company.com --cluster=leaf-cluster ssh %r@%h:%p
    ProxyCommand "/usr/local/bin/tbot" -d proxy --destination-dir=/Users/anurag/tbot-user --proxy=ctf01-teleport-proxy.company.com ssh --cluster=leaf-cluster %r@%h:%p

tsh ssh proxy command

  ProxyCommand "/Volumes/Data/usr/local/bin/tsh" -d proxy ssh --cluster=leaf-cluster --proxy=ctf01-teleport-proxy.company.com %r@%h:%p
strideynet commented 1 year ago

tsh proxy ssh doesn't work with identity files when TLS routing is disabled: https://github.com/gravitational/teleport/issues/15190

strideynet commented 1 year ago

Extremely rough notes

Cluster may be set from the Issuer of the certificate and key lookup searches by --cluster which is reused by ssh proxy subsystem to indicate target cluster:

// GetKey returns the user's key including the specified certs.
func (s *MemLocalKeyStore) GetKey(idx KeyIndex, opts ...CertOption) (*Key, error) {
    var key *Key
    if idx.ClusterName == "" {
        // If clusterName is not specified then the cluster-dependent fields
        // are not considered relevant and we may simply return any key
        // associated with any cluster name whatsoever.
        for _, found := range s.inMem[idx.ProxyHost][idx.Username] {
            key = found
            break
        }
    } else {
        key = s.inMem[idx.ProxyHost][idx.Username][idx.ClusterName]
    }
    sshUserHost := fmt.Sprintf("%s:%s", sp.targetHost, sp.targetPort)
    if err = sess.RequestSubsystem(ctx, proxySubsystemName(sshUserHost, sp.clusterName)); err != nil {
        return trace.Wrap(err)
    }

Hence, an identity file which is generated in a root cluster, cannot be used by tsh proxy ssh to connect to a leaf cluster.

func makeClientForProxy(cf *CLIConf, proxy string, useProfileLogin bool) (*client.TeleportClient, error) {

Potential work-around for Proxies with ssh proxy port exposed

# Begin generated Teleport configuration for root.tele.ottr.sh by tbot

# Common flags for all root.tele.ottr.sh hosts
Host *.root.tele.ottr.sh root.tele.ottr.sh
    UserKnownHostsFile "/Users/noahstride/tbot-openssh-test/known_hosts"
    IdentityFile "/Users/noahstride/tbot-openssh-test/key"
    CertificateFile "/Users/noahstride/tbot-openssh-test/key-cert.pub"
    HostKeyAlgorithms ssh-rsa-cert-v01@openssh.com
    PubkeyAcceptedAlgorithms +ssh-rsa-cert-v01@openssh.com

Host *.leaf.tele.ottr.sh
    UserKnownHostsFile "/Users/noahstride/tbot-openssh-test/known_hosts"
    IdentityFile "/Users/noahstride/tbot-openssh-test/key"
    CertificateFile "/Users/noahstride/tbot-openssh-test/key-cert.pub"
    HostKeyAlgorithms ssh-rsa-cert-v01@openssh.com
    PubkeyAcceptedAlgorithms +ssh-rsa-cert-v01@openssh.com
    Port 3022
    ProxyCommand ssh -F ~/path/back/to/this/config/file -p 3023 %r@root.tele.ottr.sh -s proxy:%h:%p@leaf.tele.ottr.sh

Essentially:

Limitation: TOFU still shows. TOFU would not show if we provided leaf cluster host certs.

Skipping tbot wrapper

Host *.leaf.tele.ottr.sh !leaf.tele.ottr.sh
    Port 3022
    ProxyCommand "/Users/noahstride/code/gravitational/teleport/build/tsh" proxy -d --identity=/Users/noahstride/code/gravitational/teleport/tbot-user/identity --proxy=root.tele.ottr.sh ssh --cluster=leaf.tele.ottr.sh  %r@%h:%p

ERROR REPORT:
Original Error: *trace.NotFoundError key for {ProxyHost:root.tele.ottr.sh Username:bot-ssh-test ClusterName:leaf.tele.ottr.sh} not found
Stack Trace:
        github.com/gravitational/teleport/lib/client/keystore.go:974 github.com/gravitational/teleport/lib/client.(*MemLocalKeyStore).GetKey
        github.com/gravitational/teleport/lib/client/keyagent.go:337 github.com/gravitational/teleport/lib/client.(*LocalKeyAgent).GetKey
        github.com/gravitational/teleport/lib/client/keyagent.go:677 github.com/gravitational/teleport/lib/client.(*LocalKeyAgent).ClientCertPool
        github.com/gravitational/teleport/tool/tsh/proxy.go:261 main.dialSSHProxy
        github.com/gravitational/teleport/tool/tsh/proxy.go:184 main.sshProxy
        github.com/gravitational/teleport/tool/tsh/proxy.go:79 main.onProxyCommandSSH.func1
        github.com/gravitational/teleport/lib/client/api.go:731 github.com/gravitational/teleport/lib/client.RetryWithRelogin
        github.com/gravitational/teleport/tool/tsh/proxy.go:67 main.onProxyCommandSSH
        github.com/gravitational/teleport/tool/tsh/tsh.go:1036 main.Run
        github.com/gravitational/teleport/tool/tsh/tsh.go:448 main.main
        runtime/proc.go:250 runtime.main
        runtime/asm_arm64.s:1172 runtime.goexit
User Message: key for {ProxyHost:root.tele.ottr.sh Username:bot-ssh-test ClusterName:leaf.tele.ottr.sh} not found
eric-belhomme commented 1 year ago

Some thoughts I had whilst writing this

Users will need a way to configure that they want certificates/config generated for a leaf cluster. My current suggestion is that we add a Cluster field to the DestinationConfig which allows the user to specify that they want generated configurations for a specific cluster, where this field is not specified, we should fall back to the cluster that tbot is directly connected to. My concern with this, is that it may be quite explicit for users who want to configure access to a large number of clusters. We may need to identify customers with the need for Machine ID leaf cluster support and see if this will serve their needs.

That sounds good to me, as long as we can specify a leaf clusters list as a regex pattern or any similar wildcard ! In my use-case, leaf clusters don't necessarily exists yet, but will in the future... Their naming convention ensure they'll be clearly identified, and I don't expect to have to touch machine-id config each time a new cluster is created ;)

strideynet commented 1 year ago

It has occurred to me that the behaviour of tsh config is to output all leaf clusters.

I think it might be nice to match this behaviour as it meets the needs of most users, and reduces the amount of configuration overhead. If this proves to be problematic from a performance point of view, we can consider introducing the ability to filter which of them should be included in the configuration - and users with a larger number of leaf clusters can opt into this filtering.

strideynet commented 1 year ago

@Joerger's recent modifications to the tsh store seem to have fixed the following error

ERROR: key for {ProxyHost:teleport.local.ottr.sh Username:bot-robot1 ClusterName:leaf.local.ottr.sh} not found

ERROR: unable to execute tsh
executing `tsh proxy`
exit status 1

I can confirm this fix is present from v12.0.0 onwards - older tsh and tbot clients will not be supported.

This means that the remaining work in #23368 resolves this issue entirely.