gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.67k stars 1.77k forks source link

Teleport 12 Test Plan #20132

Closed russjones closed 1 year ago

russjones commented 1 year ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh installation of the version to be released as well as an upgrade of the previous version of Teleport.

User accounting @tigrato

Combinations @capnspacehook

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with EKS/GKE @AntonAM

Teleport with multiple Kubernetes clusters @AntonAM

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Kubernetes auto-discovery @tigrato

Kubernetes Secret Storage @tigrato

Kubernetes Pod RBAC @tigrato

Teleport with FIPS mode @r0mant

ACME @mdwn

Migrations @r0mant @zmb3

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers @camscale

GitHub External SSO @Tener

tctl sso family of commands @Tener

For help with setting up sso connectors, check out the Quick GitHub/SAML/OIDC Setup Tips

tctl sso configure helps to construct a valid connector definition:

tctl sso test test a provided connector definition, which can be loaded from file or piped in with tctl sso configure or tctl get --with-secrets. Valid connectors are accepted, invalid are rejected with sensible error messages.

Teleport Plugins @greedy52

AWS Node Joining @gabrielcorado

Docs

Kubernetes Node Joining @gabrielcorado

Cloud Labels @GavinFrazar

Passwordless @codingllama

This feature has additional build requirements, so it should be tested with a pre-release build from Drone (eg: https://get.gravitational.com/teleport-v10.0.0-alpha.2-linux-amd64-bin.tar.gz).

This sections complements "Users -> Managing MFA devices". tsh binaries for each operating system (Linux, macOS and Windows) must be tested separately for FIDO2 items.

Device Trust @sfreiberg

Device Trust requires Teleport Enterprise.

This feature has additional build requirements, so it should be tested with a pre-release build from Drone (eg: https://get.gravitational.com/teleport-v10.0.0-alpha.2-linux-amd64-bin.tar.gz).

Client-side enrollment requires a signed tsh for macOS, make sure to use the tsh binary from tsh.app.

A simple formula for testing device authorization is:

# Before enrollment.
# Replace with other kinds of access, as appropriate (db, kube, etc)
tsh ssh node-that-requires-device-trust
> ERROR: ssh: rejected: administratively prohibited (unauthorized device)
# Register the device.
# Get the serial number from "Apple -> About This Mac".
tctl devices add --os=macos --asset-tag=<SERIAL_NUMBER> --enroll
# Enroll the device.
tsh device enroll --token=<TOKEN_FROM_COMMAND_ABOVE>
tsh logout; tsh login
# After enrollment
tsh ssh node-that-requires-device-trust
> $

Hardware Key Support @Joerger

Hardware Key Support is an Enterprise feature and is not available for OSS.

You will need a YubiKey 4.3+ to test this feature.

This feature has additional build requirements, so it should be tested with a pre-release build from Drone (eg: https://get.gravitational.com/teleport-ent-v11.0.0-alpha.2-linux-amd64-bin.tar.gz).

Server Access @Joerger

These tests should be carried out sequentially. tsh tests should be carried out on Linux, MacOS, and Windows.

  1. [x] tsh login as user with Webauthn login and no hardware key requirement.
  2. [x] Request a role with role.role_options.require_session_mfa: hardware_key - tsh login --request-roles=hardware_key_required
    • [x] Assuming the role should force automatic re-login with yubikey
    • [x] tsh ssh
    • [x] Requires yubikey to be connected for re-login
    • [x] Prompts for per-session MFA
  3. [x] Request a role with role.role_options.require_session_mfa: hardware_key_touch - tsh login --request-roles=hardware_key_touch_required
    • [x] Assuming the role should force automatic re-login with yubikey
    • [x] Prompts for touch if not cached (last touch within 15 seconds)
    • [x] tsh ssh
    • [x] Requires yubikey to be connected for re-login
    • [x] Prompts for touch if not cached
  4. [x] tsh logout and tsh login as the user with no hardware key requirement.
  5. [x] Upgrade auth settings to auth_service.authentication.require_session_mfa: hardware_key
    • [x] Using the existing login session (tsh ls) should force automatic re-login with yubikey
    • [x] tsh ssh
    • [x] Requires yubikey to be connected for re-login
    • [x] Prompts for per-session MFA
  6. [x] Upgrade auth settings to auth_service.authentication.require_session_mfa: hardware_key_touch
    • [x] Using the existing login session (tsh ls) should force automatic re-login with yubikey
    • [x] Prompts for touch if not cached
    • [x] tsh ssh
    • [x] Requires yubikey to be connected for re-login
    • [x] Prompts for touch if not cached

Other @GavinFrazar

Set auth_service.authentication.require_session_mfa: hardware_key_touch in your cluster auth settings.

Performance @rosstimothy @fspmarshall

Perform all tests on the following configurations:

Soak Test @rosstimothy @fspmarshall

Run 30 minute soak test with a mix of interactive/non-interactive sessions for both direct and reverse tunnel nodes:

tsh bench --duration=30m user@direct-dial-node ls
tsh bench -i --duration=30m user@direct-dial-node ps uax

tsh bench --duration=30m user@reverse-tunnel-node ls
tsh bench -i --duration=30m user@reverse-tunnel-node ps uax

Observe prometheus metrics for goroutines, open files, RAM, CPU, Timers and make sure there are no leaks

Concurrent Session Test

Run a concurrent session test that will spawn 5 interactive sessions per node in the cluster:

tsh bench sessions --max=5000 user ls
tsh bench sessions --max=5000 --web user ls 

Robustness @rosstimothy @fspmarshall

Teleport with Cloud Providers @hugoShaka

AWS @hugoShaka

GCP @hugoShaka

IBM @hugoShaka

Application Access @mdwn

Database Access @smallinsky

TLS Routing @smallinsky

Desktop Access @ibeckermayer

Binaries compatibility @tobiaszheller

Machine ID @timothyb89

SSH

With a default Teleport instance configured with a SSH node:

Ensure the above tests are completed for both:

DB Access

With a default Postgres DB instance, a Teleport instance configured with DB access and a bot user configured:

Host users creation @lxea

Host users creation docs Host users creation RFD

CA rotations @espadolini

EC2 Discovery @lxea

EC2 Discovery docs

Documentation @ptgott @alexfornuto

Checks should be performed on the version of documentation corresponding to the major release we're testing for. For example, for Teleport 12 release use branch/v12 branch and make sure to select "Version 12.0" in the documentation version switcher.

Resources

Quick GitHub/SAML/OIDC Setup Tips

Tener commented 1 year ago

@mdwn I've added these points to the test plan, seemingly under your remit:

- [ ] Verify [Azure CLI access](https://goteleport.com/docs/ver/12.x/application-access/guides/azure/) with `tsh app login`.
  - [ ] Can interact with Azure using `tsh az` commands.
  - [ ] Can interact with Azure using a combination of `tsh proxy az` and `az` commands.
- [ ] Verify [GCP CLI access](https://github.com/gravitational/teleport/pull/19905) with `tsh app login`.
  - [ ] Can interact with GCP using `tsh gcloud` commands.
  - [ ] Can interact with Google Cloud Storage using `tsh gsutil` commands.
  - [ ] Can interact with GCP/GCS using a combination of `tsh proxy gcloud` and `gcloud`/`gsutil` commands.

Both Azure and GCP integrations are on master and will be part of the cut tomorrow. The PR for the GCP docs is in review https://github.com/gravitational/teleport/pull/19905, but slightly out of sync with the implementation; this will be corrected early next week. Please don't hesitate to ask me for any clarifications or tips.

I'll send PR updating the test plan template too.

GavinFrazar commented 1 year ago

I've added/edited these points to the test plan (for discovery and connect via local/remote cluster):

  - [ ] Azure single-server MySQL and Postgres
  - [ ] Azure flexible-server MySQL and Postgres

Added them to connect test because flexible server integration required an update to the way we modify db username in the engine.

Forgot to update the test plan template in #19759 I'll open a PR to update that template now as well.

espadolini commented 1 year ago

Added "Changing role map of existing Trusted Cluster" here and in #20325

tigrato commented 1 year ago

Added in #20274

hugoShaka commented 1 year ago

tctl does not default to local auth: https://github.com/gravitational/teleport/issues/20346

timothyb89 commented 1 year ago

Regression in tsh breaks identity file loading (affects most tbot proxying features, including ssh/db access): https://github.com/gravitational/teleport/issues/20373

and another small issue where tbot's ssh_config forgets nonstandard ports: https://github.com/gravitational/teleport/issues/20378

GavinFrazar commented 1 year ago

20384 issue with PIV yubikey integration

mdwn commented 1 year ago

AWS console is inaccessible via the Teleport UI: https://github.com/gravitational/teleport/issues/20385

tigrato commented 1 year ago

The default access role misses permissions to list pods #20401

codingllama commented 1 year ago

tsh login --auth=local uses platform passwordless if it can (#20429). Not a huge deal, as it does respect other settings/flags, but I'll take a look.

mdwn commented 1 year ago

Setting Azure identities doesn't work for all valid characters in an identity string: https://github.com/gravitational/teleport/issues/20434

hugoShaka commented 1 year ago

Helm chart deadlock: https://github.com/gravitational/teleport/pull/20488

GavinFrazar commented 1 year ago

teleport db configure create --azure-sqlserver-discovery=$region generates invalid config yaml. fix here: https://github.com/gravitational/teleport/pull/20496

codingllama commented 1 year ago

I've found some issues with device trust, unusual verbs (create_enroll_token and enroll) and RoleAdmin. Pushing patches soon. (FYI @sfreiberg.)

codingllama commented 1 year ago

I've found some issues with device trust, unusual verbs (create_enroll_token and enroll) and RoleAdmin. Pushing patches soon. (FYI @sfreiberg.)

Promised patches: #20505 and https://github.com/gravitational/teleport.e/pull/724. We'll need an e/ bump on branch/v12 after all is done.

codingllama commented 1 year ago

tctl devices rm issue: #20506

camscale commented 1 year ago

Okta SSO documentation setup issue: https://github.com/gravitational/teleport/issues/20538

rosstimothy commented 1 year ago

etcd Load Testing

Agent Mesh

10k Tunnel Nodes

image

https://teleportcoreteam.grafana.net/goto/9JtLQdTVz?orgId=1

10k Direct Dial Nodes

image

https://teleportcoreteam.grafana.net/goto/ss-CjOoVk?orgId=1

500 Trusted Cluster

image

https://teleportcoreteam.grafana.net/goto/yNS_PHo4z?orgId=1

Soak Test

----Direct Dial Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@node-6f44d86564-lqmrg ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         150 ms
50         152 ms
75         156 ms
90         160 ms
95         165 ms
99         191 ms
100        455 ms

----Reverse Tunnel Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@iot-node-86b9c86bff-fgrxs ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         146 ms
50         150 ms
75         154 ms
90         162 ms
95         170 ms
99         191 ms
100        411 ms

Proxy Peering

10k Tunnel Nodes

image

https://teleportcoreteam.grafana.net/goto/5i-Am-T4z?orgId=1

10k Direct Dial Nodes

image

https://teleportcoreteam.grafana.net/goto/cv55cFT4z?orgId=1

500 Trusted Cluster

image

https://teleportcoreteam.grafana.net/goto/Oog2Abo4z?orgId=1

Soak Test

----Direct Dial Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@node-6f44d86564-frcnx ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         147 ms
50         149 ms
75         153 ms
90         156 ms
95         160 ms
99         188 ms
100        352 ms

----Reverse Tunnel Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@iot-node-86b9c86bff-x4pbr ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         164 ms
50         169 ms
75         176 ms
90         186 ms
95         191 ms
99         209 ms
100        439 ms
ravicious commented 1 year ago

~teleport-ent macOS binaries are not signed.~ https://github.com/gravitational/teleport.e/issues/741

Edit: It turns out it's a known issue.

codingllama commented 1 year ago

App Access and require_session_mfa issue: #20634.

alexfornuto commented 1 year ago

~Docker tctl command failure: https://github.com/gravitational/teleport/issues/20637.~ Resolved

capnspacehook commented 1 year ago

Error connecting to leaf OpenSSH node: https://github.com/gravitational/teleport/issues/20703

GavinFrazar commented 1 year ago

App Access and require_session_mfa issue: #20634.

Just want to note that my only remaining unchecked task is to use require_session_mfa: hardware_key_touch with app access, which also does not work (obviously)

codingllama commented 1 year ago

App Access and require_session_mfa issue: #20634.

Just want to note that my only remaining unchecked task is to use require_session_mfa: hardware_key_touch with app access, which also does not work (obviously)

Isn't hardware_key_touch functionally equivalent to no session MFA, due to the checks being client-side? Maybe chat with @Joerger to figure out if it's actually supposed to work.

Joerger commented 1 year ago

Just want to note that my only remaining unchecked task is to use require_session_mfa: hardware_key_touch with app access, which also does not work (obviously)

Isn't hardware_key_touch functionally equivalent to no session MFA, due to the checks being client-side? Maybe chat with @Joerger to figure out if it's actually supposed to work.

Yes, hardware_key_touch does actually work in v12, but hardware_key does not as it is functionally equivalent to require_session_mfa: yes. I've tested it so I'll check it off on the test plan.

Edit: I was wrong on this and the test I performed was inadequate (tsh proxy app does prompt for tap, but using the connection fails due to the app session using a different key on the server). This test should have been removed before as this lack of support was already discovered and documented.

GavinFrazar commented 1 year ago

tsh proxy aws --endpoint-url not working in alpha.1 & alpha.2: https://github.com/gravitational/teleport/issues/20798

I think this broke when I did some refactoring work in app access.

GavinFrazar commented 1 year ago

tsh db connect fails when PIV is enabled (not in all configurations): https://github.com/gravitational/teleport/issues/20799

I didn't notice this in my first pass through the test plan because I didn't trip on one of the configurations that has the issue.

fspmarshall commented 1 year ago

DynamoDB

Direct Dial Scaling

10k-direct-scaling

Direct Dial Soak

$ tsh bench --duration=30m root@ip-172-31-8-224-us-west-2-compute-internal ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         156 ms
50         165 ms
75         178 ms
90         188 ms
95         194 ms
99         219 ms
100        2767 ms
$ tsh bench --interactive --duration=30m root@ip-172-31-8-224-us-west-2-compute-internal ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         164 ms
50         174 ms
75         186 ms
90         194 ms
95         201 ms
99         223 ms
100        1677 ms

Tunnel Scaling

10k-tunnel-scaling

Tunnel Soak

$ tsh bench --duration=30m root@ip-172-31-8-224-us-west-2-compute-internal ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         176 ms
50         183 ms
75         192 ms
90         204 ms
95         212 ms
99         250 ms
100        2229 ms
$ tsh bench --interactive --duration=30m root@ip-172-31-8-224-us-west-2-compute-internal ps aux

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         184 ms
50         191 ms
75         201 ms
90         213 ms
95         220 ms
99         260 ms
100        2805 ms
fspmarshall commented 1 year ago

Note about ssh agent forwarding and ssh file copying RBAC tests: The RBAC section of the testplan mentions that we expect access denied to show up in the audit log for all items, but ssh agent forwarding and ssh file copying do not generate access denied events. I checked the code for these checks, and it appears that they aren't intended to emit events currently, so this doesn't seem to be a regression.

I've marked these sections as complete in the testplan because teleport seems to be working as intended, but it may be worth considering adding access denied events for these items.

alexfornuto commented 1 year ago

Verify Teleport versions throughout documentation are correct and reflect upcoming release:

https://github.com/gravitational/teleport/pull/20643

hugoShaka commented 1 year ago

teleport-cluster chart v12 breaks when used with etcd backend: https://github.com/gravitational/teleport/issues/20960

tigrato commented 1 year ago

Pod RBAC fails if the Kubernetes Vendor runs with compression enabled: #20980 - PR: #20981

@hugoShaka detected this issue when running a test in the IBM cloud

ibeckermayer commented 1 year ago

Issue and fix here: https://github.com/gravitational/teleport/pull/21009

alexfornuto commented 1 year ago

Verify upcoming releases page is accurate:

This is under Documentation, but we don't have insight into what's being slated for the next release. I've prepped a page in #21283 for data to be added to, but maybe this should be considered part of #21317