gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.67k stars 1.77k forks source link

Teleport 11 Test Plan #16951

Closed r0mant closed 2 years ago

r0mant commented 2 years ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh install of the version to be released as well as an upgrade of the previous version of Teleport.

User accounting @jakule

Combinations @nklaassen

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with EKS/GKE @tigrato

Teleport with multiple Kubernetes clusters @AntonAM

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Kubernetes auto-discovery @tigrato

Kubernetes Secret Storage @tigrato

Teleport with FIPS mode @alistanis

ACME @alistanis

Migrations @jakule

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers

tctl sso family of commands @Tener

For help with setting up sso connectors, check out the Quick GitHub/SAML/OIDC Setup Tips

tctl sso configure helps to construct a valid connector definition:

tctl sso test test a provided connector definition, which can be loaded from file or piped in with tctl sso configure or tctl get --with-secrets. Valid connectors are accepted, invalid are rejected with sensible error messages.

Teleport Plugins @hugoShaka

AWS Node Joining @nklaassen

Docs

Passwordless @codingllama

Passwordless requires tsh compiled with libfido2 for most operations (apart from Touch ID). Ask for a statically-built tsh binary for realistic tests.

Touch ID requires a properly built and signed tsh binary. Ask for a pre-release binary, so you may run the tests.

This sections complements "Users -> Managing MFA devices". tsh binaries for each operating system (Linux, macOS and Windows) must be tested separately for FIDO2 items.

Hardware Key Support @Joerger

Hardware Key Support is an Enterprise feature and is not available for OSS.

You will need a YubiKey 4.3+ to test this feature.

This feature has additional build requirements, so it should be tested with a pre-release build from Drone (eg: https://get.gravitational.com/teleport-ent-v11.0.0-alpha.2-linux-amd64-bin.tar.gz).

These tests should be carried out sequentially. tsh tests should be carried out on Linux, MacOS, and Windows.

Performance @rosstimothy @fspmarshall

Perform all tests on the following configurations:

Soak Test

Run 30 minute soak test with a mix of interactive/non-interactive sessions for both direct and reverse tunnel nodes:

tsh bench --duration=30m user@direct-dial-node ls
tsh bench -i --duration=30m user@direct-dial-node ps uax

tsh bench --duration=30m user@reverse-tunnel-node ls
tsh bench -i --duration=30m user@reverse-tunnel-node ps uax

Observe prometheus metrics for goroutines, open files, RAM, CPU, Timers and make sure there are no leaks

Concurrent Session Test

Run a concurrent session test that will spawn 5 interactive sessions per node in the cluster:

tsh bench sessions --max=5000 user ls
tsh bench sessions --max=5000 --web user ls 

Teleport with Cloud Providers

AWS @hugoShaka

GCP @AntonAM

IBM @atburke

Application Access @mdwn

Database Access @smallinsky + db access team

TLS Routing @smallinsky

Desktop Access @ibeckermayer @probakowski @LKozlowski

Binaries compatibility @fheinecke

Machine ID @timothyb89

SSH

With a default Teleport instance configured with a SSH node:

Ensure the above tests are completed for both:

DB Access

With a default Postgres DB instance, a Teleport instance configured with DB access and a bot user configured:

Host users creation @lxea

Host users creation docs Host users creation RFD

CA rotations @espadolini

EC2 Discovery @lxea

EC2 Discovery docs

Resources

Quick GitHub/SAML/OIDC Setup Tips

rosstimothy commented 2 years ago

I ran into some issues with the new config v3 changes: https://github.com/gravitational/teleport/issues/17118

nklaassen commented 2 years ago

So far I am unable to ssh to an OpenSSH node using tsh or the Web UI.

On the ssh node, I see userauth_pubkey: unsupported public key algorithm: rsa-sha2-256-cert-v01@openssh.com [preauth]

tsh and the Web UI show ERROR: access denied to ec2-user connecting to <ip> on cluster <my cluster>.

The connection works using the OpenSSH client connecting through the teleport proxy

The SSH node is an ec2 instance running the latest amazon linux 2, sshd version is OpenSSH_7.4p1

edit: I get the same error running Teleport v10.0.0 edit 2: with a newer sshd, tsh begins to work but the openssh client stops working. Filed an issue with details: https://github.com/gravitational/teleport/issues/17197

mdwn commented 2 years ago

tsh ssh tests

tsh ssh host command spams an auditd error for regular or remote nodes running in docker: https://github.com/gravitational/teleport/issues/17185

tsh play seems to have a default API domain of teleport.cluster.local when attempting to play a remote recording: https://github.com/gravitational/teleport/issues/17192

application access tests

teleport app start outputs the wrong flags during a misconfiguration: https://github.com/gravitational/teleport/issues/17264

teleport configure for app_servers produces invalid/deprecated YAML: https://github.com/gravitational/teleport/issues/17268

general

tctl create with no arguments blocks forever: https://github.com/gravitational/teleport/issues/17271

Joerger commented 2 years ago

tsh proxy ssh -J <leaf-proxy> doesn't work with root shut down - https://github.com/gravitational/teleport/issues/17184

ibeckermayer commented 2 years ago

Desktop Access clipboard sharing is broken -- https://github.com/gravitational/teleport/issues/17195

jakule commented 2 years ago

Enhanced recording, aka BPF, seems to be broken on v11.

https://github.com/gravitational/teleport/issues/17203

espadolini commented 2 years ago

v10 leaf clusters are mostly unusable from v11 roots: #17211

rosstimothy commented 2 years ago

etcd Load Testing

Agent Mesh

10k Tunnel Nodes

image

https://teleportcoreteam.grafana.net/goto/c6BFvMI4z?orgId=1

10k Direct Dial Nodes

image

https://teleportcoreteam.grafana.net/goto/SX6JDGI4z?orgId=1

500 Trusted Cluster

image

https://teleportcoreteam.grafana.net/goto/tuTUDGIVz?orgId=1

Soak Test

----Direct Dial Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@node-77d968c88-d8mlt ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         162 ms
50         167 ms
75         173 ms
90         181 ms
95         189 ms
99         211 ms
100        484 ms

tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@node-77d968c88-d8mlt ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         163 ms
50         168 ms
75         174 ms
90         181 ms
95         189 ms
99         208 ms
100        434 ms

----Reverse Tunnel Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@iot-node-785fb8fc99-999nx ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         164 ms
50         169 ms
75         174 ms
90         181 ms
95         186 ms
99         203 ms
100        404 ms

 tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@iot-node-785fb8fc99-999nx ps aux

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         164 ms
50         170 ms
75         175 ms
90         181 ms
95         187 ms
99         208 ms
100        456 ms

Proxy Peering

10k Tunnel Nodes

image

https://teleportcoreteam.grafana.net/goto/XXiMOGIVk?orgId=1

10k Direct Dial Nodes

image

https://teleportcoreteam.grafana.net/goto/CKcndGI4z?orgId=1

500 Trusted Cluster

image

https://teleportcoreteam.grafana.net/goto/34V4OGSVk?orgId=1

Soak Test

----Direct Dial Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@node-77d968c88-vtkdv ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         157 ms
50         162 ms
75         167 ms
90         173 ms
95         178 ms
99         200 ms
100        427 ms

tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@node-77d968c88-vtkdv ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         158 ms
50         162 ms
75         167 ms
90         172 ms
95         176 ms
99         198 ms
100        425 ms

----Reverse Tunnel Node Test----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@iot-node-785fb8fc99-tgdc8 ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         162 ms
50         167 ms
75         173 ms
90         179 ms
95         185 ms
99         204 ms
100        438 ms

tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@iot-node-785fb8fc99-tgdc8 ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         162 ms
50         167 ms
75         174 ms
90         181 ms
95         188 ms
99         208 ms
100        336 ms
fspmarshall commented 2 years ago

DynamoDB

10k Direct Dial Scaling

loadtest-v11-10k-non-iot

Direct Dial Soak

$ tsh bench --duration=30m <user>@<host> ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         171 ms            
50         179 ms            
75         188 ms            
90         197 ms            
95         205 ms            
99         259 ms            
100        1845 ms
$ tsh bench --duration=30m --interactive <user>@<host> ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         177 ms            
50         186 ms            
75         194 ms            
90         205 ms            
95         215 ms            
99         306 ms            
100        2251 ms

10k Tunnel Scaling

loadtest-v11-10k-iot

Tunnel Soak

$ tsh bench --duration=30m <user>@<host> ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         155 ms            
50         161 ms            
75         179 ms            
90         184 ms            
95         188 ms            
99         214 ms            
100        1186 ms
$ tsh bench --duration=30m --interactive <user>@<host> ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         155 ms            
50         161 ms            
75         179 ms            
90         184 ms            
95         188 ms            
99         211 ms            
100        469 ms

500 Trusted Clusters

500-tc

Upgrade At Scale

In addition to normal scaling tests, I did a step by step upgrade of a 10K node dynamo cluster in order to asses the dynamoDB usage differences between v10.2 and v11.0.0-alpha.2. This was done in order to assess the effects of https://github.com/gravitational/teleport/pull/16911 on dynamoDB read capacity.

Below are two dynamo DB stat page images. The first shows a v10.2 cluster being restarted, and the second shows the same restart procedure being used to apply an upgrade from v10.2 to v11.0.0-alpha.2 (we use a non-upgrading restart as the comparison point since it helps us control for the load created by cache resets and disruption of heartbeats):

loadtest-v10-restart-2 loadtest-v11-upgrade

Note the difference in the "read usage" sections between the restart and upgrade cases. Both have a similar large spike immediately after restart due to cache resets, with the upgrade case stabilizing at a much higher average read usage (~29 vs ~1.5). In theory, a read usage of 29 for a 10k cluster is practically nothing, but the proportional difference between the resting rate before and after https://github.com/gravitational/teleport/pull/16911 does make me nervous. Such a jump might negatively impact users with very high numbers of peak concurrent sessions if they have fine-tuned their dynamo read capacity to just barely accommodate their existing load. We don't recommend doing things like that, and we generally encourage people to use on-demand, but it still gives me pause. Haven't made up my mind yet, but I think I might revert the compare-and-swap semantics introduced in https://github.com/gravitational/teleport/pull/16911 in favor of an approach that has a lower impact.

smallinsky commented 2 years ago

Small issue with Snowflake DB Access: tctl auth sign call on leaf cluster in case of multi trusted clusters setup: https://github.com/gravitational/teleport/issues/17262

PR with a fix https://github.com/gravitational/teleport/pull/17263

fspmarshall commented 2 years ago

Opted to revert compare-and-swap node heartbeats based on dynamo stats in https://github.com/gravitational/teleport/issues/16951#issuecomment-1273939513.

PR with fix: https://github.com/gravitational/teleport/pull/17308

jdconti commented 2 years ago

Can we please add X11 tests as a non-root user to this (and future) test plans? Thanks!

ibeckermayer commented 2 years ago

Desktop Access clipboard sharing is broken -- #17195

Webapps PR with the fix is here https://github.com/gravitational/webapps/pull/1250 ~Ideally https://github.com/gravitational/webapps/pull/1251 gets merged and backported as well~

Update: resolved

Joerger commented 2 years ago

Hardware key support broke between v11.0.0-alpha.2 and v11.0.0-beta.1 - https://github.com/gravitational/teleport/issues/17415

Edit: False alarm, it only doesn't work in proxy recording mode as expected... I've added the Hardware Key Support tests to the test plan to double check everything with v11.0.0-beta.1.

jakule commented 2 years ago

/var/log/wtmp is not being updated correctly https://github.com/gravitational/teleport/pull/17416

tigrato commented 2 years ago

Teleport Kube Agent Chart hook is failing due to a wrong find & replace #17437

espadolini commented 2 years ago
hugoShaka commented 2 years ago

Onelogin SSO integration guide still works but a couple of screenshots and concepts would need an update: https://github.com/gravitational/teleport/issues/17485

codingllama commented 2 years ago

tsh / Windows: tsh mfa add for OTPs doesn't show me the QR code. (Typing the key still works.)

FYI @tobiaszheller

codingllama commented 2 years ago

Raised https://github.com/gravitational/teleport/issues/17563 and https://github.com/gravitational/teleport/issues/17564, neither is blocking for the release.

jakule commented 2 years ago

Created https://github.com/gravitational/teleport/issues/17572