gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.3k stars 1.74k forks source link

Teleport 8.0 Test Plan #8665

Closed russjones closed 2 years ago

russjones commented 2 years ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh install of the version to be released as well as an upgrade of the previous version of Teleport.

User accounting @xacrimon

Combinations @tcsc

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with EKS/GKE @smallinsky

Teleport with multiple Kubernetes clusters @r0mant

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Teleport with FIPS mode @russjones

ACME @Joerger

Migrations @r0mant @russjones

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers @benarent

Teleport Plugins @Joerger

WEB UI @kimlisa @rudream @gzdunek

Main

For main, test with a role that has access to all resources.

Top Nav

Side Nav

Servers aka Nodes

Applications

Databases

Audit log

Users

Auth Connectors

Auth Connectors Card Icons

Roles

Managed Clusters

Help & Support

Access Requests

Creating Access Requests

  1. Create a role with limited permissions (defined below as allow-roles). This role allows you to see the Role screen and ssh into all nodes.
  2. Create another role with limited permissions (defined below as allow-users). This role session expires in 4 minutes, allows you to see Users screen, and denies access to all nodes.
  3. Create another role with no permissions other than being able to create requests (defined below as default)
  4. Create a user with role default assigned
  5. Create a few requests under this user to test pending/approved/denied state.
    kind: role
    metadata:
    name: allow-roles
    spec:
    allow:
    logins:
    - root
    node_labels:
      '*': '*'
    rules:
    - resources:
      - role
      verbs:
      - list
      - read
    options:
    max_session_ttl: 8h0m0s
    version: v3
    kind: role
    metadata:
    name: allow-users
    spec:
    allow:
    rules:
    - resources:
      - user
      verbs:
      - list
      - read
    deny:
    node_labels:
      '*': '*'
    options:
    max_session_ttl: 4m0s
    version: v3
    kind: role
    metadata:
    name: default
    spec:
    allow:
    request:
      roles:
      - allow-roles
      - allow-users
      suggested_reviewers:
      - random-user-1
      - random-user-2
    options:
    max_session_ttl: 8h0m0s
    version: v3
    • [x] Verify that creating a new request works
    • [x] Verify that under requestable roles, only allow-roles and allow-users are listed
    • [x] Verify input validation requires at least one role to be selected
    • [x] Verify you can select/input/modify reviewers
    • [x] Verify after creating, requests are listed in pending states
    • [x] Verify you can't review own requests

Viewing & Approving/Denying Requests

Create a user with the role reviewer that allows you to review all requests, and delete them.

kind: role
version: v3
metadata:
  name: reviewer
spec:
  allow:
    review_requests:
      roles: ['*']

Assuming Approved Requests

Access Request Waiting Room

Strategy Reason

Create the following role:

kind: role
metadata:
  name: restrict
spec:
  allow:
    request:
      roles:
      - <some other role to assign user after approval>
  options:
    max_session_ttl: 8h0m0s
    request_access: reason
    request_prompt: <some custom prompt to show in reason dialogue>
version: v3

Strategy Always

With the previous role you created from Strategy Reason, change request_access to always:

Strategy Optional

With the previous role you created from Strategy Reason, change request_access to optional:

Terminal

Node List Tab

Session Tab

Session Player

Invite Form

Login Form

Multi-factor Authentication (mfa)

Create/modify teleport.yaml and set the following authentication settings under auth_service

authentication:
  type: local
  second_factor: optional
  require_session_mfa: yes
  webauthn:
    rp_id: example.com

MFA invite, login, password reset, change password

MFA require auth

Go to Account Settings > Two-Factor Devices and register a new device

Using the same user as above:

MFA Management

Cloud

Invite/Reset

Recovery Code Management

Recovery Flow: Add new mfa device

Recovery Flow: Change password

Recovery Email

RBAC

Create a role, with no allow.rules defined:

kind: role
metadata:
  name: test
spec:
  allow:
    app_labels:
      '*': '*'
    logins:
    - root
    node_labels:
      '*': '*'
  options:
    max_session_ttl: 8h0m0s
version: v3

Note: User has read/create access_request access to their own requests, despite resource settings

Add the following under spec.allow.rules to enable read access to the audit log:

  - resources:
      - event
      verbs:
      - list

Add the following to enable read access to recorded sessions

  - resources:
      - session
      verbs:
      - read

Add the following to enable read access to the roles

- resources:
      - role
      verbs:
      - list
      - read

Add the following to enable read access to the auth connectors

- resources:
      - auth_connector
      verbs:
      - list
      - read

Add the following to enable read access to users

  - resources:
      - user
      verbs:
      - list
      - read

Add the following to enable read access to trusted clusters

  - resources:
      - trusted_cluster
      verbs:
      - list
      - read

Performance/Soak Test @fspmarshall @rosstimothy

Using tsh bench tool, perform the soak tests and benchmark tests on the following configurations:

Soak Tests

Run 4hour soak test with a mix of interactive/non-interactive sessions:

tsh bench --duration=4h user@teleport-monster-6757d7b487-x226b ls
tsh bench -i --duration=4h user@teleport-monster-6757d7b487-x226b ps uax

Observe prometheus metrics for goroutines, open files, RAM, CPU, Timers and make sure there are no leaks

Breaking load tests

Load system with tsh bench to the capacity and publish maximum numbers of concurrent sessions with interactive and non interactive tsh bench loads.

Teleport with Cloud Providers

AWS @timothyb89

GCP @xacrimon

IBM @xacrimon

Application Access @r0mant @smallinsky

Database Access @r0mant @smallinsky

Desktop Access @zmb3

TLS Routing @r0mant

zmb3 commented 2 years ago

Desktop Access: Seeing the arrow keys map incorrectly on macOS client. Same behavior in Chrome, Safari, and Firefox. This is using the built-in MacBook keyboard.

This is addressed with https://github.com/gravitational/teleport/pull/8791 which still needs to be backported (https://github.com/gravitational/teleport/pull/8813).

r0mant commented 2 years ago

Potential K8s access issue when used via TLS routing. I get:

➜  ~ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for mbp, root.gravitational.io, host.minikube.internal, localhost, remote.kube.proxy.teleport.cluster.local, host.minikube.internal, *.teleport.cluster.local, teleport.cluster.local, *.root.gravitational.io, *.host.minikube.internal, not kube.

I'm using kubernetes_service with kubeconfig_file (if that matters). cc @smallinsky

zmb3 commented 2 years ago

Desktop Access: disconnects due to client idle timeout fail to emit a disconnect event to the audit log (but otherwise disconnect correctly and emit a desktop session end event correctly). This is due to attempting to use the desktop ID as the event's "server ID."

Rejecting audit event client.disconnect("") from "3b581a11-b94a-4274-860b-8266868ca42e": 
server "3b581a11-b94a-4274-860b-8266868ca42e" can't emit event with 
server ID "foo-example-com". 

The server ID must be the Windows Desktop Service's HostUUID, otherwise the ValidateServerMetadata check will fail.

Fixed in #8828, which needs merge + backport.

rosstimothy commented 2 years ago

etcd Load Tests

3 auth 2 proxy 1 node non-IoT

tsh bench --duration=30m root@loadtest-7fbf6bcbfc-b5l7l ls

* Requests originated: 17988
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1289 ms
50         1301 ms
75         1319 ms
90         1374 ms
95         1600 ms
99         1648 ms
100        2053 ms
tsh bench --interactive --duration=30m root@loadtest-7fbf6bcbfc-b5l7l ps aux

* Requests originated: 17987
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1301 ms
50         1314 ms
75         1330 ms
90         1352 ms
95         1388 ms
99         1724 ms
100        3553 ms

3 auth 2 proxy 1 node IoT

tsh bench --duration=30m root@loadtest-7fbf6bcbfc-ww7h8 ls

* Requests originated: 17987
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1314 ms
50         1326 ms
75         1340 ms
90         1359 ms
95         1381 ms
99         1561 ms
100        3329 ms
tsh bench --interactive --duration=30m root@loadtest-7fbf6bcbfc-ww7h8 ps aux

* Requests originated: 17987
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1302 ms
50         1314 ms
75         1328 ms
90         1344 ms
95         1363 ms
99         1457 ms
100        3479 ms

10k node non-IoT

8 0 0-beta 2-10k-etcd-non-IoT

Soak tests:

tsh bench --duration=30m root@loadtest-7fbf6bcbfc-52zdr ls 

* Requests originated: 17982
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1481 ms
50         1505 ms
75         1541 ms
90         1623 ms
95         1714 ms
99         1920 ms
100        3335 ms
tsh bench --interactive --duration=30m root@loadtest-7fbf6bcbfc-klws7 ps aux

* Requests originated: 17985
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1496 ms
50         1515 ms
75         1543 ms
90         1624 ms
95         1720 ms
99         1978 ms
100        4647 ms

10k nodes IoT

8 0 0 -beta 2-10k-etcd-IoT

Soak tests:

tsh bench --duration=30m root@loadtest-7fbf6bcbfc-47hh9 ls

* Requests originated: 17986
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1499 ms
50         1530 ms
75         1642 ms
90         1834 ms
95         1990 ms
99         2257 ms
100        4259 ms
tsh bench --interactive --duration=30m root@loadtest-7fbf6bcbfc-6gjnv ps aux

* Requests originated: 17982
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         1527 ms
50         1558 ms
75         1682 ms
90         1897 ms
95         2035 ms
99         2249 ms
100        3143 ms

500 Trusted Clusters

image

timothyb89 commented 2 years ago

Tested out several of the AWS deployment examples. Both Terraform examples worked fine with the 8.0.0-beta.2 AMIs (tested the simple example with the OSS AMIs, and the HA example with Enterprise).

The CloudFormation example seems very broken. I'm told it hasn't been maintained in quite a while and am not sure if we care to validate it here. Some of its issues:

zmb3 commented 2 years ago

Desktop access: heartbeats for multiple Windows hosts discovered via LDAP all report the same host.

Issue #8846 Fixed in #8847

russjones commented 2 years ago

Desktop Access: Investigate required libraries.

Issue https://github.com/gravitational/teleport/issues/8765

timothyb89 commented 2 years ago

Just filed https://github.com/gravitational/teleport/issues/8860 for my issues with the CloudFormation example, though given https://github.com/gravitational/teleport/issues/8665#issuecomment-961368649, I wonder if I hit a similar issue.

russjones commented 2 years ago

Aggregate last 3 releases.

Backend Cluster Size Mode PTY 6.2 7.0 8.0
etcd 10k Regular No ~49183 ms~ ~56383 ms~ 4475 ms 3335 ms
etcd 10k Regular Yes ~59423 ms~ ~61215 ms~ 4507 ms 4647 ms
etcd 10k Tunnel No ~65439 ms~ ~53759 ms~ 4451 ms 4259 ms
etcd 10k Tunnel Yes ~64924 ms~ ~48223 ms~ 4435 ms 3143 ms
DynamoDB 10k Regular No
DynamoDB 10k Regular Yes
DynamoDB 10k Tunnel No
DynamoDB 10k Tunnel Yes
DynamoDB 1 Regular No 2471 ms 1824 ms
DynamoDB 1 Regular Yes 2081 ms 1483 ms
DynamoDB 1 Tunnel No 826 ms 2125 ms
DynamoDB 1 Tunnel Yes 518 ms 2002 ms
tcsc commented 2 years ago

Re: Check agent forwarding is correct based on role and proxy mode.

rec mode fwd allowed by role fwd not allowed by role
off allowed disallowed
proxy allowed conn failed
node allowed disallowed

Caveats:

I'm still not sure what Proxy Mode means in this context, but I've interpreted to to mean the session_recording mode. Even so, I am still not sure what the correct Agent Forwarding behaviour is for the different recording modes, or even how they should be expected to affect the Agent Forwarding modes.

Fully aware that I may have been testing the wrong thing, I considered something that looks like this to be "allowed":

agent-fwd-allowed

..and something that looks like this was considered "disallowed":

agent-fwd-denied

smallinsky commented 2 years ago

ALPN Proxy + Reverse Tunnel fails when ACME is used:

Issue https://github.com/gravitational/teleport/issues/8665#issuecomment-961368649 Fixed in https://github.com/gravitational/teleport/pull/8869

fspmarshall commented 2 years ago

Dynamo (IoT)

10k Scaling

10k-iot-dynamo

Soak

tsh bench --duration=30m root@ip-172-31-11-250-us-west-2-compute-internal ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         152 ms            
50         166 ms            
75         181 ms            
90         194 ms            
95         208 ms            
99         283 ms            
100        2125 ms
tsh bench --interactive --duration=30m root@ip-172-31-11-250-us-west-2-compute-internal ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         159 ms            
50         169 ms            
75         179 ms            
90         192 ms            
95         207 ms            
99         284 ms            
100        2002 ms

Dynamo (non-IoT)

10k Scaling

10k-non-iot-dynamo

Soak

tsh bench --duration=30m root@ip-172-31-11-250-us-west-2-compute-internal ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         145 ms            
50         154 ms            
75         164 ms            
90         176 ms            
95         188 ms            
99         237 ms            
100        1824 ms
tsh bench --interactive --duration=30m root@ip-172-31-11-250-us-west-2-compute-internal ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         150 ms            
50         161 ms            
75         174 ms            
90         186 ms            
95         195 ms            
99         238 ms            
100        1483 ms

Dynamo (500 Trusted Cluster)

500-tc-dynamo


Notes