gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.41k stars 1.74k forks source link

Teleport 16 Test Plan #42118

Closed r0mant closed 3 months ago

r0mant commented 4 months ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh installation of the version to be released as well as an upgrade of the previous version of Teleport.

User accounting @atburke

Combinations @Joerger

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with EKS/GKE @AntonAM

Teleport with multiple Kubernetes clusters @tigrato

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Kubernetes exec via WebSockets/SPDY @AntonAM

To control usage of websockets on kubectl side environment variable KUBECTL_REMOTE_COMMAND_WEBSOCKETS can be used: KUBECTL_REMOTE_COMMAND_WEBSOCKETS=true kubectl -v 8 exec -n namespace podName -- /bin/bash --version. With -v 8 logging level you should be able to see X-Stream-Protocol-Version: v5.channel.k8s.io in case kubectl is connected over websockets to Teleport. To do tests you'll need kubectl version at least 1.29, Kubernetes cluster v1.29 or less (doesn't support websockets stream protocol v5) and cluster v1.30 (does support it by default) and to access them both through kube agent and kubeconfig each.

Kubernetes auto-discovery @AntonAM

Kubernetes Secret Storage @AntonAM

Kubernetes Pod RBAC @AntonAM

Teleport with FIPS mode @bl-nero

ACME @bl-nero

Migrations @tigrato

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers

GitHub External SSO @capnspacehook

tctl sso family of commands @Tener

For help with setting up sso connectors, check out the [Quick GitHub/SAML/OIDC Setup Tips]

tctl sso configure helps to construct a valid connector definition:

tctl sso test test a provided connector definition, which can be loaded from file or piped in with tctl sso configure or tctl get --with-secrets. Valid connectors are accepted, invalid are rejected with sensible error messages.

SSO login on remote host @atburke

tsh should be running on a remote host (e.g. over an SSH session) and use the local browser to complete and SSO login. Run tsh login --callback <remote.host>:<port> --bind-addr localhost:<port> --auth <auth> on the remote host. Note that the --callback URL must be able to resolve to the --bind-addr over HTTPS.

Teleport Plugins @EdwardDowling

Teleport Operator @hugoShaka

AWS Node Joining @hugoShaka

Docs

Kubernetes Node Joining @hugoShaka

Azure Node Joining @marcoandredinis

Docs

GCP Node Joining @marcoandredinis

Docs

Cloud Labels @atburke

Passwordless @codingllama

This feature has additional build requirements, so it should be tested with a pre-release build (eg: https://cdn.teleport.dev/tsh-v16.0.0-alpha.2.pkg).

This sections complements "Users -> Managing MFA devices". tsh binaries for each operating system (Linux, macOS and Windows) must be tested separately for FIDO2 items.

Device Trust @codingllama

Device Trust requires Teleport Enterprise.

This feature has additional build requirements, so it should be tested with a pre-release build (eg: https://cdn.teleport.dev/teleport-ent-v16.0.0-alpha.2-linux-amd64-bin.tar.gz).

Client-side enrollment requires a signed tsh for macOS, make sure to use the tsh binary from tsh.app.

Additionally, Device Trust Web requires Teleport Connect to be installed (device authentication for the Web is handled by Connect).

A simple formula for testing device authorization is:

# Before enrollment.
# Replace with other kinds of access, as appropriate (db, kube, etc)
tsh ssh node-that-requires-device-trust
> ERROR: ssh: rejected: administratively prohibited (unauthorized device)

# Register/enroll the device.
tsh device enroll --current-device
tsh logout; tsh login

# After enrollment
tsh ssh node-that-requires-device-trust
> $

Hardware Key Support @Joerger

Hardware Key Support is an Enterprise feature and is not available for OSS.

You will need a YubiKey 4.3+ to test this feature.

This feature has additional build requirements, so it should be tested with a pre-release build (eg: https://cdn.teleport.dev/teleport-ent-v16.0.0-alpha.2-linux-amd64-bin.tar.gz).

Server Access

This test should be carried out on Linux, MacOS, and Windows.

Set auth_service.authentication.require_session_mfa: hardware_key_touch in your cluster auth settings and login.

HSM Support @nklaassen

Docs

Run the full test suite with each HSM/KMS:

$ make run-etcd # in background shell
$
$ # test YubiHSM
$ yubihsm-connector -d # in a background shell
$ cat /etc/yubihsm_pkcs11.conf
# /etc/yubihsm_pkcs11.conf
connector = http://127.0.0.1:12345
debug
$ TELEPORT_TEST_YUBIHSM_PKCS11_PATH=/usr/local/lib/pkcs11/yubihsm_pkcs11.dylib TELEPORT_TEST_YUBIHSM_PIN=0001password YUBIHSM_PKCS11_CONF=/etc/yubihsm_pkcs11.conf go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_YUBIHSM_PKCS11_PATH=/usr/local/lib/pkcs11/yubihsm_pkcs11.dylib TELEPORT_TEST_YUBIHSM_PIN=0001password YUBIHSM_PKCS11_CONF=/etc/yubihsm_pkcs11.conf TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1 --timeout 20m # this takes ~12 minutes
$
$ # test AWS KMS
$ # login in to AWS locally
$ AWS_ACCOUNT="$(aws sts get-caller-identity | jq -r '.Account')"
$ TELEPORT_TEST_AWS_KMS_ACCOUNT="${AWS_ACCOUNT}" TELEPORT_TEST_AWS_REGION=us-west-2 go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_AWS_KMS_ACCOUNT="${AWS_ACCOUNT}" TELEPORT_TEST_AWS_REGION=us-west-2 TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1
$
$ # test AWS CloudHSM
$ # set up the CloudHSM cluster and run this on an EC2 that can reach it
$ TELEPORT_TEST_CLOUDHSM_PIN="<CU_username>:<CU_password>" go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_CLOUDHSM_PIN="<CU_username>:<CU_password>" TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1
$
$ # test GCP KMS
$ # login in to GCP locally
$ TELEPORT_TEST_GCP_KMS_KEYRING=projects/<account>/locations/us-west3/keyRings/<keyring> go test ./lib/auth/keystore -v --count 1
$ TELEPORT_TEST_GCP_KMS_KEYRING=projects/<account>/locations/us-west3/keyRings/<keyring> TELEPORT_ETCD_TEST=1 go test ./integration/hsm -v --count 1

Moderated session @rosstimothy

Create two Teleport users, a moderator and a user. Configure Teleport roles to require that the moderator moderate the user's sessions. Use TELEPORT_HOME to tsh login as the user in one terminal, and the moderator in another.

Ensure the default terminationPolicy of terminate has not been changed.

For each of the following cases, create a moderated session with the user using tsh ssh and join this session with the moderator using tsh join --role moderator:

Performance @rosstimothy @fspmarshall @espadolini

Scaling Test

Scale up the number of nodes/clusters a few times for each configuration below.

1) Verify that there are no memory/goroutine/file descriptor leaks 2) Compare the baseline metrics with the previous release to determine if resource usage has increased 3) Restart all Auth instances and verify that all nodes/clusters reconnect

Perform reverse tunnel node scaling tests for all backend configurations:

Soak Test

Run 30 minute soak test directly against direct and tunnel nodes and via label based matching. Tests should be run against a Cloud tenant.

tsh bench ssh --duration=30m user@direct-dial-node ls
tsh bench ssh --duration=30m user@reverse-tunnel-node ls
tsh bench ssh --duration=30m user@foo=bar ls
tsh bench ssh --duration=30m --random user@foo ls

Concurrent Session Test

Run a concurrent session test that will spawn 5 interactive sessions per node in the cluster:

tsh bench web sessions --max=5000 user ls
tsh bench web sessions --max=5000 --web user ls

Robustness

Teleport with Cloud Providers

AWS @camscale

GCP @marcoandredinis

IBM @hugoShaka

Application Access @gabrielcorado

Database Access @greedy52

TLS Routing @greedy52

r0mant commented 4 months ago

Desktop Access @probakowski @ibeckermayer

Binaries / OS compatibility

Verify that our software runs on the minimum supported OS versions as per https://goteleport.com/docs/installation/#operating-system-support

Windows @ravicious

Azure offers virtual machines with the Windows 10 2016 LTSB image. This image runs on Windows 10 rev. 1607, which is the exact minimum Windows version that we support.

macOS @camscale

Linux @camscale

Machine ID @timothyb89

With an SSH node registered to the Teleport cluster:

With a Postgres DB registered to the Teleport cluster:

With a Kubernetes cluster registered to the Teleport cluster:

With a HTTP application registered to the Teleport cluster:

Host users creation @atburke

Host users creation docs Host users creation RFD

CA rotations @fspmarshall

Proxy Peering

Proxy Peering docs

SSH Connection Resumption @fspmarshall

Verify that SSH works, and that resumable SSH is not interrupted across a Teleport Cloud tenant upgrade. Standard node Non-resuming node Peered node Agentless node
tsh ssh
  • [x]
  • [x]
  • [x]
  • [x]
tsh ssh --no-resume
  • [x]
  • [x]
  • [x]
  • [x]
Teleport Connect
  • [x]
  • [x]
  • [x]
  • [x]
Web UI (not resuming)
  • [x]
  • [x]
  • [x]
  • [x]
OpenSSH (standard tsh config)
  • [x]
  • [x]
  • [x]
  • [x]
OpenSSH (changing ProxyCommand to tsh proxy ssh --no-resume)
  • [x]
  • [x]
  • [x]
  • [x]

Verify that SSH works, and that resumable SSH is not interrupted across a control plane restart (of either the root or the leaf cluster).

Tunnel node Direct dial node
tsh ssh
  • [x]
  • [x]
tsh ssh --no-resume
  • [x]
  • [x]
tsh ssh (from a root cluster)
  • [x]
  • [x]
tsh ssh --no-resume (from a root cluster)
  • [x]
  • [x]
OpenSSH (without ProxyCommand) n/a
  • [x]
OpenSSH's ssh-keyscan n/a
  • [x]

EC2 Discovery @marcoandredinis

EC2 Discovery docs

Azure Discovery @marcoandredinis

Azure Discovery docs

GCP Discovery @lxea

GCP Discovery docs

IP Pinning @AntonAM

Add a role with pin_source_ip: true (requires Enterprise) to test IP pinning. Testing will require changing your IP (that Teleport Proxy sees). Docs: IP Pinning

Assist @jakule

Assist is not supported by tsh and WebUI is the only way to use it. Assist test plan is in the core section instead of WebUI as most functionality is implemented in the core.

IGS @smallinsky

Teleport SAML Identity Provider @flyinghermit

Verify SAML IdP service provider resource management.

Docs:

Manage Service Provider (SP)

SAML service provider catalog

Resources

Quick GitHub/SAML/OIDC Setup Tips

ravicious commented 4 months ago
nklaassen commented 4 months ago
atburke commented 4 months ago
timothyb89 commented 4 months ago
ravicious commented 4 months ago
bl-nero commented 4 months ago
capnspacehook commented 4 months ago
greedy52 commented 4 months ago

Not a regression though.

codingllama commented 4 months ago

Arguable if this is something we need to address, but it seemed better to "document" it anyway.

hugoShaka commented 4 months ago

Nitpick: the theme picker text is bugged

ibeckermayer commented 3 months ago
bl-nero commented 3 months ago
hugoShaka commented 3 months ago
ibeckermayer commented 3 months ago
ibeckermayer commented 3 months ago
rosstimothy commented 3 months ago

Performance Test Results

Cloud

Load Tests

10k Resources ![Screenshot 2024-06-10 at 9 06 47 AM](https://github.com/gravitational/teleport/assets/39066650/bf022dce-cf04-498b-900b-a7587c35ec86) ![Screenshot 2024-06-10 at 9 07 08 AM](https://github.com/gravitational/teleport/assets/39066650/1808ca72-84c5-4deb-a706-0cc3eba8b780)

Soak Tests

Origin: us-east-1 Target: us-east-1 ```bash tsh bench ssh --duration=30m root@node-agents-5b8c8bb49-zzh6r-09 /busybox/ls -lah / * Requests originated: 17998 * Requests failed: 0 Histogram Percentile Response Duration ---------- ----------------- 25 241 ms 50 250 ms 75 262 ms 90 305 ms 95 393 ms 99 1286 ms 100 4959 ms ```
Origin: us-west-2 Target: us-east-1 ```bash tsh bench ssh --duration=30m root@node-agents-5b8c8bb49-zzh6r-09 /busybox/ls -lah / * Requests originated: 17992 * Requests failed: 0 Histogram Percentile Response Duration ---------- ----------------- 25 879 ms 50 890 ms 75 905 ms 90 952 ms 95 1196 ms 99 1795 ms 100 2997 ms ```

etcd[^1]

30k Resources ![backend1](https://github.com/gravitational/teleport/assets/39066650/dcc47835-a73c-41f8-a34b-62c3938b3c40) ![backend2](https://github.com/gravitational/teleport/assets/39066650/2753d05d-8590-4412-9d64-4ea039305c34) ![instance](https://github.com/gravitational/teleport/assets/39066650/ddbb291c-138a-4db5-a234-3802b62a2522)
500 Trusted Clusters ![500 TC](https://github.com/gravitational/teleport/assets/39066650/04d7b376-f8a7-4a85-a0c7-fecea5bb6997)

Postgres[^1]

30k Resources ![backend-1](https://github.com/gravitational/teleport/assets/39066650/fc9b95d8-88d2-421f-959e-431d1cb0dcf2) ![backend-2](https://github.com/gravitational/teleport/assets/39066650/3db218d9-6e02-4cd3-ab91-620d96741d7f) ![instance](https://github.com/gravitational/teleport/assets/39066650/374c541c-5899-4ca0-b4ee-ff493bd78c44)

Firestore[^1]

30k Resources ![backend1](https://github.com/gravitational/teleport/assets/39066650/e0bafad8-e606-489f-8bb4-9b34e3120442) ![backend2](https://github.com/gravitational/teleport/assets/39066650/0968503c-44a9-4de5-85f5-ca2139c46f03) ![instance](https://github.com/gravitational/teleport/assets/39066650/649fe5cb-f8b2-47ee-af88-7b4b3892e51a)

[^1]: 30k tests were performed using the simulated method described in the v14 Test Plan

greedy52 commented 3 months ago

Database Access load test (PostgreSQL and MySQL)

Setup

same as previous test but in ca-central-1.

EKS with a single node group:

Teleport cluster (all deployed on the EKS cluster):

Databases:

Note: Databases were configured using discovery running inside the database agent.

tsh bench commands were executed inside the cluster.

MySQL

10 connections/second (90 Percentile 80ms) Screenshot 2024-06-11 at 4 36 57 PM Screenshot 2024-06-11 at 4 37 11 PM Screenshot 2024-06-11 at 4 37 19 PM Screenshot 2024-06-11 at 4 37 28 PM ``` # tsh bench mysql mysql-proxy-rdsproxy --db-user=mysql --db-name=mysql --rate=10 --duration=30m * Requests originated: 17999 * Requests failed: 0 Histogram Percentile Response Duration ---------- ----------------- 25 62 ms 50 67 ms 75 74 ms 90 80 ms 95 85 ms 99 117 ms 100 703 ms ```
50 connections/second (90 Percentile 467ms) Screenshot 2024-06-11 at 4 41 45 PM Screenshot 2024-06-11 at 4 41 56 PM Screenshot 2024-06-11 at 4 42 08 PM Screenshot 2024-06-11 at 4 42 15 PM ``` # tsh bench mysql mysql-proxy-rdsproxy --db-user=mysql --db-name=mysql --rate=50 --duration=30m * Requests originated: 89985 * Requests failed: 9 * Last error: io.ReadFull(header) failed. err EOF: connection was bad Histogram Percentile Response Duration ---------- ----------------- 25 164 ms 50 246 ms 75 349 ms 90 467 ms 95 552 ms 99 736 ms 100 1424 ms ```

PostgreSQL

10 connections/second (90 Percentile 93ms) Screenshot 2024-06-11 at 4 44 01 PM Screenshot 2024-06-11 at 4 44 06 PM Screenshot 2024-06-11 at 4 44 13 PM Screenshot 2024-06-11 at 4 44 24 PM ``` # tsh bench postgres postgres-proxy-rdsproxy --db-user=postgres --db-name=postgres --rate=10 --duration=30m * Requests originated: 18000 * Requests failed: 0 Histogram Percentile Response Duration ---------- ----------------- 25 74 ms 50 80 ms 75 87 ms 90 93 ms 95 99 ms 99 201 ms 100 1077 ms ```
50 connections/second (90 Percentile 499ms) Screenshot 2024-06-11 at 4 45 11 PM Screenshot 2024-06-11 at 4 45 36 PM Screenshot 2024-06-11 at 4 45 57 PM Screenshot 2024-06-11 at 4 46 06 PM ``` # tsh bench postgres postgres-proxy-rdsproxy --db-user=postgres --db-name=postgres --rate=50 --duration=30m * Requests originated: 89986 * Requests failed: 27 * Last error: failed to connect to `host=127.0.0.1 user=teleport database=teleport`: failed to receive message (unexpected EOF) Histogram Percentile Response Duration ---------- ----------------- 25 183 ms 50 269 ms 75 375 ms 90 499 ms 95 586 ms 99 791 ms 100 2217 ms ```

Database Access resources count test

Setup

This is an one-time manual setup:

500 databases per agent, 50k keepalives

5k unique db resources in total. Cloud Watch Dashboard

Timestamp Agent Count db_server Count Auth CPU% Auth Mem% Proxy CPU% Proxy Mem%
17:30 20 10,000 3% 5% 3% 5%
18:20 50 25,000 7% 11% 6% 6%
18:50 100 50,000 15% 22% 12% 8%

20 databases per agent, 10k keepalives

1k unique db resources in total.

Cloud Watch Dashboard

Timestamp Agent Count db_server Count Auth CPU% Auth Mem% Proxy CPU% Proxy Mem%
16:00 100 2,000 1% 3% 1% 5%
16:20 250 5,000 2% 4.5% 3% 6%
16:40 500 10,000 3% 6% 3% 8%
17:00 0 0 <1% 3% <1% 3 %
GavinFrazar commented 3 months ago
Tener commented 3 months ago
greedy52 commented 3 months ago

Found by @Tener