gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.69k stars 1.77k forks source link

Teleport 9.0 Test Plan #10446

Closed r0mant closed 2 years ago

r0mant commented 2 years ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh install of the version to be released as well as an upgrade of the previous version of Teleport.

User accounting @xacrimon

Combinations @atburke

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with EKS/GKE @jimbishopp

Teleport with multiple Kubernetes clusters @jimbishopp

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Teleport with FIPS mode @r0mant

ACME @lxea

Migrations @r0mant @zmb3

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers @benarent @ptgott @xinding33

Teleport Plugins @Joerger

WEB UI @kimlisa @ravicious @hatched @gzdunek

Main

For main, test with a role that has access to all resources.

Top Nav

Side Nav

Servers aka Nodes

Applications

Databases

Audit log

Users

Auth Connectors

Managed Clusters

Help & Support

Access Requests @Joerger

Creating Access Requests

  1. Create a role with limited permissions (defined below as allow-roles). This role allows you to see the Role screen and ssh into all nodes.
  2. Create another role with limited permissions (defined below as allow-users). This role session expires in 4 minutes, allows you to see Users screen, and denies access to all nodes.
  3. Create another role with no permissions other than being able to create requests (defined below as default)
  4. Create a user with role default assigned
  5. Create a few requests under this user to test pending/approved/denied state.
    kind: role
    metadata:
    name: allow-roles
    spec:
    allow:
    logins:
    - root
    node_labels:
      '*': '*'
    rules:
    - resources:
      - role
      verbs:
      - list
      - read
    options:
    max_session_ttl: 8h0m0s
    version: v3
    kind: role
    metadata:
    name: allow-users-short-ttl
    spec:
    allow:
    rules:
    - resources:
      - user
      verbs:
      - list
      - read
    deny:
    node_labels:
      '*': '*'
    options:
    max_session_ttl: 4m0s
    version: v3
    kind: role
    metadata:
    name: default
    spec:
    allow:
    request:
      roles:
      - allow-roles
      - allow-users
      suggested_reviewers:
      - random-user-1
      - random-user-2
    options:
    max_session_ttl: 8h0m0s
    version: v3
    • [x] #10642
    • [x] Verify input validation requires at least one role to be selected
    • [x] Verify you can select/input/modify reviewers
    • [x] Verify after creating a request, requests are listed in pending states
    • [x] Verify you can't review own requests

Viewing & Approving/Denying Requests

Create a user with the role reviewer that allows you to review all requests, and delete them.

kind: role
version: v3
metadata:
  name: reviewer
spec:
  allow:
    review_requests:
      roles: ['*']

Assuming Approved Requests

Access Request Waiting Room @kimlisa

Strategy Reason

Create the following role:

kind: role
metadata:
  name: waiting-room
spec:
  allow:
    request:
      roles:
      - <some other role to assign user after approval>
  options:
    max_session_ttl: 8h0m0s
    request_access: reason
    request_prompt: <some custom prompt to show in reason dialogue>
version: v3

Strategy Always

With the previous role you created from Strategy Reason, change request_access to always:

Strategy Optional

With the previous role you created from Strategy Reason, change request_access to optional:

Terminal

Node List Tab

Session Tab

Session Player

Invite and Reset Form

Login Form and Change Password

Multi-factor Authentication (mfa)

Create/modify teleport.yaml and set the following authentication settings under auth_service

authentication:
  type: local
  second_factor: optional
  require_session_mfa: yes
  webauthn:
    rp_id: example.com

MFA invite, login, password reset, change password

MFA require auth

Go to Account Settings > Two-Factor Devices and register a new device

Using the same user as above:

MFA Management

Cloud

From your cloud staging account, change the field teleportVersion to the test version.

$ kubectl -n <namespace> edit tenant

Recovery Code Management

Invite/Reset

Recovery Flow: Add new mfa device

Recovery Flow: Change password

Recovery Email

RBAC

Create a role, with no allow.rules defined:

kind: role
metadata:
  name: rbac
spec:
  allow:
    app_labels:
      '*': '*'
    logins:
    - root
    node_labels:
      '*': '*'
  options:
    max_session_ttl: 8h0m0s
version: v3

Note: User has read/create access_request access to their own requests, despite resource settings

Add the following under spec.allow.rules to enable read access to the audit log:

  - resources:
      - event
      verbs:
      - list

Add the following to enable read access to recorded sessions

  - resources:
      - session
      verbs:
      - read

Add the following to enable read access to the roles

- resources:
      - role
      verbs:
      - list
      - read

Add the following to enable read access to the auth connectors

- resources:
      - auth_connector
      verbs:
      - list
      - read

Add the following to enable read access to users

  - resources:
      - user
      verbs:
      - list
      - read

Add the following to enable read access to trusted clusters

  - resources:
      - trusted_cluster
      verbs:
      - list
      - read

Performance/Soak Test @fspmarshall @espadolini @rosstimothy

Using tsh bench tool, perform the soak tests and benchmark tests on the following configurations:

Soak Tests

Run 4hour soak test with a mix of interactive/non-interactive sessions:

tsh bench --duration=4h user@teleport-monster-6757d7b487-x226b ls
tsh bench -i --duration=4h user@teleport-monster-6757d7b487-x226b ps uax

Observe prometheus metrics for goroutines, open files, RAM, CPU, Timers and make sure there are no leaks

Breaking load tests

Load system with tsh bench to the capacity and publish maximum numbers of concurrent sessions with interactive and non interactive tsh bench loads.

Teleport with Cloud Providers

AWS @xacrimon

GCP @xacrimon

IBM @r0mant

Application Access @gabrielcorado @Tener

Database Access

TLS Routing @smallinsky

Desktop Access @zmb3 @lxea @ibeckermayer

webvictim commented 2 years ago

@r0mant You have me assigned to IBM, should probably pick someone else :D

r0mant commented 2 years ago

@webvictim I can pick someone else but I'm told you're the only one who knows how to do it 😄 Will you be able to help out the person who'll be doing it? Would be good opportunity to transfer the knowledge too.

Tener commented 2 years ago

@r0mant are we sharing particular binaries for testing or is everyone meant to build their own once a release branch is made?

russjones commented 2 years ago

@Tener Yes, once we cut off branch/v9 @r0mant or @zmb3 will share a tag with the team to start testing with.

russjones commented 2 years ago

@r0mant Before we can cut off branch/v9 we have to update e/Makefile to add tbot target: https://github.com/gravitational/teleport.e/pull/401.

xacrimon commented 2 years ago

Ran the utmp tests and while they didn't suffer regressions from v8 I confirmed the issue fixed by #10460 on RHEL 8.4 as reported in the corresponding issue. I know we didn't cut branch/v9 yet but had to do a full manual test run of the feature anyway for the PR.

Tener commented 2 years ago

During my tests, I ran into an issue with tctl users add: https://github.com/gravitational/teleport/issues/10574.

For sure, the docs need to be updated, but we have also removed some functionality which I'm not sure is a correct/good thing. Details in the ticket.

Tener commented 2 years ago

Another issue: https://github.com/gravitational/teleport/issues/10576

This feels like a corner case that has always been there, but we should probably address it sooner or later.

Tener commented 2 years ago

I marked "Sessions can be recorded at the proxy" as complete, because it kind of works, but at the same time it causes some issues. Here are the details: https://github.com/gravitational/teleport/issues/10586

webvictim commented 2 years ago

@webvictim I can pick someone else but I'm told you're the only one who knows how to do it 😄 Will you be able to help out the person who'll be doing it? Would be good opportunity to transfer the knowledge too.

All I ever did was use the IBM Cloud account we have in 1Password along with the instructions in the docs from https://goteleport.com/docs/setup/deployments/ibm/

I think the best solution would be to assign this to someone else and if they run into issues, I can definitely help them out and provide some guidance based on my experience. If we can't deploy this following our own written guide then there's definitely a bigger issue to solve :)

rosstimothy commented 2 years ago

etcd load testing

Soak Tests

----Non-IoT Node Test ----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@node-5f88cb8c68-mm4xd ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         115 ms
50         119 ms
75         123 ms
90         129 ms
95         137 ms
99         185 ms
100        700 ms

tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@node-5f88cb8c68-mm4xd ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         119 ms
50         123 ms
75         128 ms
90         135 ms
95         142 ms
99         184 ms
100        393 ms

----IoT Node Test ----
tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@iot-node-88c55fcb5-9fr7b ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         116 ms
50         120 ms
75         125 ms
90         132 ms
95         139 ms
99         177 ms
100        413 ms

tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@iot-node-88c55fcb5-9fr7b ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         119 ms
50         125 ms
75         132 ms
90         141 ms
95         149 ms
99         182 ms
100        799 ms

10,000 IoT Node Scaling Test

9 0_10k_IoT_etcd

10,000 Non-IoT Node Scaling Test

9 0_10k_non_IoT_etcd

500 Trusted Clusters Scaling Test

9 0_500_TC

russjones commented 2 years ago

Aggregate last 4 releases.

Backend Cluster Size Mode PTY 6.2 7.0 8.0 9.0
etcd 10k Regular No ~49183 ms~ ~56383 ms~ 4475 ms 3335 ms 700 ms
etcd 10k Regular Yes ~59423 ms~ ~61215 ms~ 4507 ms 4647 ms 393 ms
etcd 10k Tunnel No ~65439 ms~ ~53759 ms~ 4451 ms 4259 ms 143 ms
etcd 10k Tunnel Yes ~64924 ms~ ~48223 ms~ 4435 ms 3143 ms 799 ms
DynamoDB 10k Regular No 5147 ms
DynamoDB 10k Regular Yes 222 ms
DynamoDB 10k Tunnel No 235 ms
DynamoDB 10k Tunnel Yes 198 ms
DynamoDB 1 Regular No 2471 ms 1824 ms
DynamoDB 1 Regular Yes 2081 ms 1483 ms
DynamoDB 1 Tunnel No 826 ms 2125 ms
DynamoDB 1 Tunnel Yes 518 ms 2002 ms
russjones commented 2 years ago

@rosstimothy One odd thing is that the difference between interactive and non-interactive is so much and it inverts between regular and tunnel connections.

kimlisa commented 2 years ago
rosstimothy commented 2 years ago

dynamo load testing

Soak Tests

----Non-IoT Node Test----
tsh --insecure --proxy=proxy:3080 -i /etc/teleport/auth bench --duration=30m root@node-7f454d4dbd-cxnrk ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         127 ms
50         131 ms
75         135 ms
90         139 ms
95         142 ms
99         160 ms
100        5147 ms

tsh --insecure --proxy=proxy:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@node-7f454d4dbd-cxnrk ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         128 ms
50         132 ms
75         136 ms
90         140 ms
95         143 ms
99         162 ms
100        222 ms

----IoT Node Test----
tsh --insecure --proxy=proxy:3080 -i /etc/teleport/auth bench --duration=30m root@iot-node-86666788fc-k45wn ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         130 ms
50         137 ms
75         143 ms
90         147 ms
95         149 ms
99         158 ms
100        235 ms

tsh --insecure --proxy=proxy:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@iot-node-86666788fc-k45wn ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         131 ms
50         138 ms
75         144 ms
90         148 ms
95         150 ms
99         156 ms
100        198 ms

10,000 IoT Node Scaling Test

9 0_10k_IoT_dynamo

10,000 Non-IoT Node Scaling Test

9 0_10k_non_IoT_dynamo

500 Trusted Clusters Scaling Test

9 0_500_TC_dynamo

russjones commented 2 years ago

@rosstimothy DynamoDB, 10k, non-IoT, no-PTY 100% looks off.

Percentile Response Duration
---------- -----------------
25         127 ms
50         131 ms
75         135 ms
90         139 ms
95         142 ms
99         160 ms
100        5147 ms
Tener commented 2 years ago

This feels like worth fixing prior to release: https://github.com/gravitational/teleport/issues/10794