gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.68k stars 1.77k forks source link

Teleport 17 Test Plan #48003

Closed r0mant closed 1 day ago

r0mant commented 4 weeks ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh installation of the version to be released as well as an upgrade of the previous version of Teleport.

User accounting @atburke

Combinations @Joerger

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with EKS/GKE @tigrato

Teleport with multiple Kubernetes clusters @tigrato

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Kubernetes exec via WebSockets/SPDY @tigrato

To control usage of websockets on kubectl side environment variable KUBECTL_REMOTE_COMMAND_WEBSOCKETS can be used: KUBECTL_REMOTE_COMMAND_WEBSOCKETS=true kubectl -v 8 exec -n namespace podName -- /bin/bash --version. With -v 8 logging level you should be able to see X-Stream-Protocol-Version: v5.channel.k8s.io in case kubectl is connected over websockets to Teleport. To do tests you'll need kubectl version at least 1.29, Kubernetes cluster v1.29 or less (doesn't support websockets stream protocol v5) and cluster v1.30 (does support it by default) and to access them both through kube agent and kubeconfig each.

Kubernetes auto-discovery @tigrato

Kubernetes Secret Storage @hugoShaka

Kubernetes Pod RBAC @tigrato

Teleport with FIPS mode @eriktate

ACME @timothyb89

Migrations @timothyb89

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers

GitHub External SSO @greedy52

tctl sso family of commands @Tener

For help with setting up sso connectors, check out the [Quick GitHub/SAML/OIDC Setup Tips]

tctl sso configure helps to construct a valid connector definition:

tctl sso test test a provided connector definition, which can be loaded from file or piped in with tctl sso configure or tctl get --with-secrets. Valid connectors are accepted, invalid are rejected with sensible error messages.

SSO login on remote host @atburke

tsh should be running on a remote host (e.g. over an SSH session) and use the local browser to complete and SSO login. Run tsh login --callback <remote.host>:<port> --bind-addr localhost:<port> --auth <auth> on the remote host. Note that the --callback URL must be able to resolve to the --bind-addr over HTTPS.

Teleport Plugins @EdwardDowling @bernardjkim

Teleport Operator @hugoShaka

AWS Node Joining @hugoShaka

Docs

Kubernetes Node Joining @bernardjkim

Azure Node Joining @marcoandredinis

Docs

GCP Node Joining @marcoandredinis

Docs

Cloud Labels @marcoandredinis

Passwordless @codingllama

This feature has additional build requirements, so it should be tested with a pre-release build (eg: https://cdn.teleport.dev/tsh-v16.0.0-alpha.2.pkg).

This sections complements "Users -> Managing MFA devices". tsh binaries for each operating system (Linux, macOS and Windows) must be tested separately for FIDO2 items.

Device Trust @codingllama

Device Trust requires Teleport Enterprise.

This feature has additional build requirements, so it should be tested with a pre-release build (eg: https://cdn.teleport.dev/teleport-ent-v16.0.0-alpha.2-linux-amd64-bin.tar.gz).

Client-side enrollment requires a signed tsh for macOS, make sure to use the tsh binary from tsh.app.

Additionally, Device Trust Web requires Teleport Connect to be installed (device authentication for the Web is handled by Connect).

A simple formula for testing device authorization is:

# Before enrollment.
# Replace with other kinds of access, as appropriate (db, kube, etc)
tsh ssh node-that-requires-device-trust
> ERROR: ssh: rejected: administratively prohibited (unauthorized device)

# Register/enroll the device.
tsh device enroll --current-device
tsh logout; tsh login

# After enrollment
tsh ssh node-that-requires-device-trust
> $

Hardware Key Support @Joerger

Hardware Key Support is an Enterprise feature and is not available for OSS.

You will need a YubiKey 4.3+ to test this feature.

This feature has additional build requirements, so it should be tested with a pre-release build (eg: https://cdn.teleport.dev/teleport-ent-v16.0.0-alpha.2-linux-amd64-bin.tar.gz).

Server Access

This test should be carried out on Linux, MacOS, and Windows.

Set auth_service.authentication.require_session_mfa: hardware_key_touch in your cluster auth settings and login.

HSM Support @nklaassen

Docs

Run the full test suite with each HSM/KMS:

$ make run-etcd # in background shell
$
$ # test YubiHSM
$ yubihsm-connector -d # in a background shell
$ cat /etc/yubihsm_pkcs11.conf
# /etc/yubihsm_pkcs11.conf
connector = http://127.0.0.1:12345
debug
$ TELEPORT_TEST_YUBIHSM_PKCS11_PATH=/usr/local/lib/pkcs11/yubihsm_pkcs11.dylib TELEPORT_TEST_YUBIHSM_PIN=0001password YUBIHSM_PKCS11_CONF=/etc/yubihsm_pkcs11.conf go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_YUBIHSM_PKCS11_PATH=/usr/local/lib/pkcs11/yubihsm_pkcs11.dylib TELEPORT_TEST_YUBIHSM_PIN=0001password YUBIHSM_PKCS11_CONF=/etc/yubihsm_pkcs11.conf TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1 --timeout 20m # this takes ~12 minutes
$
$ # test AWS KMS
$ # login in to AWS locally
$ AWS_ACCOUNT="$(aws sts get-caller-identity | jq -r '.Account')"
$ TELEPORT_TEST_AWS_KMS_ACCOUNT="${AWS_ACCOUNT}" TELEPORT_TEST_AWS_KMS_REGION=us-west-2 go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_AWS_KMS_ACCOUNT="${AWS_ACCOUNT}" TELEPORT_TEST_AWS_KMS_REGION=us-west-2 TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1
$
$ # test AWS CloudHSM
$ # set up the CloudHSM cluster and run this on an EC2 that can reach it
$ TELEPORT_TEST_CLOUDHSM_PIN="<CU_username>:<CU_password>" go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_CLOUDHSM_PIN="<CU_username>:<CU_password>" TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1
$
$ # test GCP KMS
$ # login in to GCP locally
$ TELEPORT_TEST_GCP_KMS_KEYRING=projects/<account>/locations/us-west3/keyRings/<keyring> go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_GCP_KMS_KEYRING=projects/<account>/locations/us-west3/keyRings/<keyring> TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1

Moderated session @eriktate

Create two Teleport users, a moderator and a user. Configure Teleport roles to require that the moderator moderate the user's sessions. Use TELEPORT_HOME to tsh login as the user in one terminal, and the moderator in another.

Ensure the default terminationPolicy of terminate has not been changed.

For each of the following cases, create a moderated session with the user using tsh ssh and join this session with the moderator using tsh join --role moderator:

Performance @rosstimothy @fspmarshall @espadolini

Scaling Test

Scale up the number of nodes/clusters a few times for each configuration below.

1) Verify that there are no memory/goroutine/file descriptor leaks 2) Compare the baseline metrics with the previous release to determine if resource usage has increased 3) Restart all Auth instances and verify that all nodes/clusters reconnect

Perform simulated load testing on non-cloud backends

Perform ansible-like load testing on cloud backends

Perform the following additional scaling tests on a single backend:

Soak Test

Run 30 minute soak test directly against direct and tunnel nodes and via label based matching. Tests should be run against a Cloud tenant.

tsh bench ssh --duration=30m user@direct-dial-node ls
tsh bench ssh --duration=30m user@reverse-tunnel-node ls
tsh bench ssh --duration=30m user@foo=bar ls
tsh bench ssh --duration=30m --random user@foo ls

Concurrent Session Test

Run a concurrent session test that will spawn 5 interactive sessions per node in the cluster:

tsh bench web sessions --max=5000 user ls
tsh bench web sessions --max=5000 --web user ls

Robustness

Teleport with Cloud Providers

AWS @hugoShaka

GCP @marcoandredinis

IBM @hugoShaka

Application Access @gabrielcorado

Database Access @greedy52

Some tests are marked with "coverved by E2E test" and automatically completed by default. In cases the E2E test is flaky or disabled, deselect the task for manualy testing.

IMPORTANT: for this round of testing, please pick a different signature algorithm suite other than the default legacy. See RFD 136. @greedy52 @Tener @GavinFrazar

TLS Routing @greedy52

r0mant commented 4 weeks ago

Desktop Access @probakowski

Binaries / OS compatibility

Verify that our software runs on the minimum supported OS versions as per https://goteleport.com/docs/installation/#operating-system-support

Windows @ravicious

Azure offers virtual machines with the Windows 10 2016 LTSB image. This image runs on Windows 10 rev. 1607, which is the exact minimum Windows version that we support.

macOS @camscale

Linux @camscale

Machine ID @strideynet @timothyb89

With an SSH node registered to the Teleport cluster:

With a Postgres DB registered to the Teleport cluster:

With a Kubernetes cluster registered to the Teleport cluster:

With a HTTP application registered to the Teleport cluster:

Host users creation @eriktate

Host users creation docs Host users creation RFD

Host users are considered "managed" when they belong to one of the teleport system groups: teleport-system, teleport-keep. Users outside of these groups are considered "unmanaged". Any users in the teleport-static group are also managed, but not considered for role-based host user creation.

CA rotations @espadolini

Proxy Peering

Proxy Peering docs

SSH Connection Resumption @fspmarshall

Verify that SSH works, and that resumable SSH is not interrupted across a Teleport Cloud tenant upgrade. Standard node Non-resuming node Peered node Agentless node
tsh ssh
  • [x]
  • [x]
  • [x]
  • [x]
tsh ssh --no-resume
  • [x]
  • [x]
  • [x]
  • [x]
Teleport Connect
  • [x]
  • [x]
  • [x]
  • [x]
Web UI (not resuming)
  • [x]
  • [x]
  • [x]
  • [x]
OpenSSH (standard tsh config)
  • [x]
  • [x]
  • [x]
  • [x]
OpenSSH (changing ProxyCommand to tsh proxy ssh --no-resume)
  • [x]
  • [x]
  • [x]
  • [x]

Verify that SSH works, and that resumable SSH is not interrupted across a control plane restart (of either the root or the leaf cluster).

Tunnel node Direct dial node
tsh ssh
  • [x]
  • [x]
tsh ssh --no-resume
  • [x]
  • [x]
tsh ssh (from a root cluster)
  • [x]
  • [x]
tsh ssh --no-resume (from a root cluster)
  • [x]
  • [x]
OpenSSH (without ProxyCommand) n/a
  • [x]
OpenSSH's ssh-keyscan n/a
  • [x]

EC2 Discovery @marcoandredinis

EC2 Discovery docs

Azure Discovery @marcoandredinis

Azure Discovery docs

GCP Discovery @atburke

GCP Discovery docs

IP Pinning @strideynet

Add a role with pin_source_ip: true (requires Enterprise) to test IP pinning. Testing will require changing your IP (that Teleport Proxy sees). Docs: IP Pinning

Identity @smallinsky

Teleport SAML Identity Provider @flyinghermit

Verify SAML IdP service provider resource management.

Docs:

Manage Service Provider (SP)

SAML service provider catalog

Resources

Quick GitHub/SAML/OIDC Setup Tips

greedy52 commented 3 weeks ago
eriktate commented 3 weeks ago
Joerger commented 3 weeks ago
Joerger commented 3 weeks ago
GavinFrazar commented 3 weeks ago
GavinFrazar commented 3 weeks ago
bernardjkim commented 3 weeks ago
bernardjkim commented 3 weeks ago

- https://github.com/gravitational/teleport/issues/48331

gabrielcorado commented 3 weeks ago
atburke commented 3 weeks ago
nklaassen commented 3 weeks ago
Tener commented 2 weeks ago

Database Access load test (PostgreSQL and MySQL)

Setup

Region: eu-central-1

EKS with a single node group:

Teleport cluster (all deployed on the EKS cluster):

Databases:

Note: Databases were configured using discovery running inside the database agent.

tsh bench commands were executed inside the cluster.

MySQL

10 connections/second (90th Percentile 70ms) ![Image](https://github.com/user-attachments/assets/6e8e6093-bdb3-4ed9-933a-5cfce6a2b6a8) ![Image](https://github.com/user-attachments/assets/cd84d061-5442-4b4f-bd2a-c5bfe26f36cc) ![Image](https://github.com/user-attachments/assets/b2bee606-0c50-4b11-ae88-a9657f6058c5) ![Image](https://github.com/user-attachments/assets/1c623b74-c4e6-449d-80a8-7ad577aca71a) ``` # tsh bench mysql mysql-proxy-rdsproxy --db-user=teleport --db-name=mysql --rate=10 --duration=30m * Requests originated: 18000 * Requests failed: 0 Histogram Percentile Response Duration ---------- ----------------- 25 58 ms 50 61 ms 75 64 ms 90 70 ms 95 75 ms 99 90 ms 100 1854 ms ```
50 connections/second (90th Percentile 70ms) ![Image](https://github.com/user-attachments/assets/5aa89f83-b0a6-490a-9379-933de7b5c0e7) ![Image](https://github.com/user-attachments/assets/5e54645d-2553-44bf-82ac-649c1dd8c923) ![Image](https://github.com/user-attachments/assets/8909a9d5-e6dc-4a84-965f-6d55311f7d20) ![Image](https://github.com/user-attachments/assets/f1cd63dc-dac4-4db9-bca9-d868dfb3af22) ``` # tsh bench mysql mysql-proxy-rdsproxy --db-user=teleport --db-name=mysql --rate=50 --duration=30m * Requests originated: 89998 * Requests failed: 2637 * Last error: handleAuthResult: readAuthResult: ERROR 1105 (HY000): failed to connect to any of the database servers Histogram Percentile Response Duration ---------- ----------------- 25 56 ms 50 58 ms 75 62 ms 90 70 ms 95 76 ms 99 103 ms 100 3087 ms ```

PostgreSQL

10 connections/second (90th Percentile 87ms) ![Image](https://github.com/user-attachments/assets/7b4e3569-b732-4de4-8309-b536f5d21d16) ![Image](https://github.com/user-attachments/assets/ae7c8c7f-912a-409f-b24b-76766bf3eefe) ![Image](https://github.com/user-attachments/assets/5b29d68f-2f68-480b-aba9-3376981915ec) ![Image](https://github.com/user-attachments/assets/8cc88974-2f24-405f-9d6d-1271a13ef49e) ``` # tsh bench postgres postgres-proxy-rdsproxy --db-user=teleport --db-name=postgres --rate=10 --duration=30m * Requests originated: 18000 * Requests failed: 0 Histogram Percentile Response Duration ---------- ----------------- 25 70 ms 50 74 ms 75 80 ms 90 87 ms 95 93 ms 99 127 ms 100 3661 ms ```
50 connections/second (90th Percentile 90ms) ![Image](https://github.com/user-attachments/assets/388fcd91-a295-42d2-91a1-c52f673e93dc) ![Image](https://github.com/user-attachments/assets/375942e0-6021-4300-948f-83b1c73cf86a) ![Image](https://github.com/user-attachments/assets/d36da080-8f46-45fc-9c32-91d8ee60f858) ![Image](https://github.com/user-attachments/assets/e5f305ee-4616-4a79-9521-5dbca0545abe) ``` # tsh bench postgres postgres-proxy-rdsproxy --db-user=teleport --db-name=postgres --rate=50 --duration=30m * Requests originated: 89997 * Requests failed: 2518 * Last error: failed to connect to `host=127.0.0.1 user=teleport database=postgres`: failed to receive message (unexpected EOF) Histogram Percentile Response Duration ---------- ----------------- 25 71 ms 50 74 ms 75 81 ms 90 90 ms 95 100 ms 99 152 ms 100 3133 ms ```
greedy52 commented 2 weeks ago

(Doesn't seem a regression. Likely broken in the last few versions.)

gabrielcorado commented 2 weeks ago
espadolini commented 2 weeks ago

Performance: ansible-like load on dynamodb on cloud

Setup

Three proxies on tenant4x nodes in each of usw2, use1, euc1, two auths on tenant8x nodes in euc1, cluster configured for higher number of incoming connections, no SNS for audit logs.

Three EKS clusters in usw2, use1, euc1, each with 64 m5.8xlarge nodes running 20k agents each, joining with a static token.

Two m7a.48xlarge runners in usw2 and euc1 running tbot and ssh (the ansible-like load), connecting to all 60k nodes and running 4 commands every 360 seconds on average (see assets/loadtest/ansible-like), so about 1300 new sessions per second. Nodes were referenced by hostname but tbot was configured to use proxy templates and nodes were searched by predicate.

The agents were spun up and left alone for about 15 minutes, then the sessions were started and ran for about 30 minutes.

Problems

From a cold start (unused Cloud staging tenant) the dynamodb table in used took a while to internally scale, with throttling that lasted for 5 or 10 minutes; no problem after that.

The ansible-like setup isn't capable of handling new or dropped agents, and in a first attempt (with clusters running 40 nodes instead of 64) some went missing because the kubernetes node they were on just died; a new set of agents was then spun up, with GOMAXPROCS set to 2 and memory request and limit set to 350Mi, which fit with a bit of extra headroom on 64 nodes and resulted in no more errors.

Results

![Image](https://github.com/user-attachments/assets/eb93b015-eec8-4d3d-9ba9-56971cab3fd5) ![Image](https://github.com/user-attachments/assets/35f6d742-e09a-4c84-9de6-620d8c23b8fa) ![Image](https://github.com/user-attachments/assets/a0af1f50-c935-44ea-b2a6-c2dcd95696c9) ![Image](https://github.com/user-attachments/assets/5b4c2c35-7567-41a6-8371-0879b4c0fb95) ![Image](https://github.com/user-attachments/assets/302a4e9d-6a72-46fa-8e71-e14ec39be1e5)
rosstimothy commented 2 weeks ago

Performance Test Results

etcd[^1]

30k Resources ![backend](https://github.com/user-attachments/assets/a55c526b-3cdc-4d80-b421-5b278ba0f3f0) ![instance](https://github.com/user-attachments/assets/266c6f2b-0a2b-49c8-bb44-a7ff3cecdff4)
500 Trusted Clusters ![500 TC](https://github.com/user-attachments/assets/c3164599-ea09-4d9e-9b6e-2e013b1ab7b8)

Postgres[^1]

30k Resources ![backend](https://github.com/user-attachments/assets/8c4e3c4c-b67d-4b5f-93c8-886cc53489a5) ![instance](https://github.com/user-attachments/assets/009db5c5-7e06-462e-b18a-8e464d8e9cdf)

Firestore[^1]

30k Resources ![backend](https://github.com/user-attachments/assets/6a05406c-c297-4cb5-83c8-96ba7169b2d5) ![instance](https://github.com/user-attachments/assets/7c808d92-fa7d-42d8-944c-77dde8522377)

[^1]: 30k tests were performed using the simulated method described in the v14 Test Plan

ravicious commented 2 weeks ago
hugoShaka commented 2 weeks ago

SSO MFA ceremony breaks tctl on auth-only clusters: https://github.com/gravitational/teleport/issues/48633

greedy52 commented 2 weeks ago
hugoShaka commented 2 weeks ago
fspmarshall commented 1 week ago

Performance: ansible-like load on multi-region crdb cloud

Setup

Three proxies on tenant4x nodes in each of usw2, use1, euc1. Four auths on tenant8x nodes, two in usw2 and two in euc1.

Three EKS clusters in usw2, use1, euc1, each with 50 m5.8xlarge nodes running 20k agents each, joining with a static token.

Two m7a.48xlarge runners in usw2 and euc1 running tbot and ssh (the ansible-like load), connecting to all 60k nodes and running 4 commands every 360 seconds on average (see assets/loadtest/ansible-like), so about 1300 new sessions per second. Nodes were referenced by hostname but tbot was configured to use proxy templates and nodes were searched by predicate.

The agents were spun up and left alone for a few minutes. The first set of 60k sessions were started, cluster was left to stabilize for a bit, then the second set was started.

Problems

Initial attempt appeared to overload some element of cloud-staging networking stack, resulting in a large number of failed connections with no apparent teleport-originating cause. Subsequent attempts succeeded.

Results

![Image](https://github.com/user-attachments/assets/7e9904b9-395e-41ec-9943-7b58d6a591dd) ![Image](https://github.com/user-attachments/assets/3ff8c2da-a953-4666-8e46-c3131bff2d65) ![Image](https://github.com/user-attachments/assets/69c4445e-a192-453a-8f34-44a1666287ef) ![Image](https://github.com/user-attachments/assets/50236469-0185-472d-8854-e70245ae5cba) ![Image](https://github.com/user-attachments/assets/c51717eb-5f20-4eec-8ea9-d7c2c4d01462)
hugoShaka commented 1 week ago

non-blocking: IBM docs are out of date