gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.3k stars 1.74k forks source link

Teleport 6.2 Test Plan #6651

Closed russjones closed 3 years ago

russjones commented 3 years ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh install of the version to be released as well as an upgrade of the previous version of Teleport.

Combinations @fspmarshall @quinqu

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with multiple Kubernetes clusters @xacrimon @webvictim

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Helm charts

Migrations @tcsc @nklaassen

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport Plugins @awly @Joerger

WEB UI @kimlisa @alex-kovoy

Main

For main, test with admin role that has access to all resources.

Top Nav

Side Nav

Servers aka Nodes

Applications

Databases

Active Sessions

Audit log

Users

Auth Connectors

Auth Connectors Card Icons

Roles

Managed Clusters

Help & Support

Access Requests

Creating Access Rquests

  1. Create a role with limited permissions (defined below as allow-roles). This role allows you to see the Role screen and ssh into all nodes.
  2. Create another role with limited permissions (defined below as allow-users). This role session expires in 4 minutes, allows you to see Users screen, and denies access to all nodes.
  3. Create another role with no permissions other than being able to create requests (defined below as default)
  4. Create a user with role default assigned
  5. Create a few requests under this user to test pending/approved/denied state.
    kind: role
    metadata:
    name: allow-roles
    spec:
    allow:
    logins:
    - root
    node_labels:
      '*': '*'
    rules:
    - resources:
      - role
      verbs:
      - list
      - read
    options:
    max_session_ttl: 8h0m0s
    version: v3
    kind: role
    metadata:
    name: allow-users
    spec:
    allow:
    rules:
    - resources:
      - user
      verbs:
      - list
      - read
    deny:
    node_labels:
      '*': '*'
    options:
    max_session_ttl: 4m0s
    version: v3
    kind: role
    metadata:
    name: default
    spec:
    allow:
    request:
      roles:
      - allow-roles
      - allow-users
      suggested_reviewers:
      - random-user-1
      - random-user-2
    options:
    max_session_ttl: 8h0m0s
    version: v3
    • [x] Verify that creating a new request works
    • [x] Verify that under requestable roles, only allow-roles and allow-users are listed
    • [x] Verify input validation requires at least one role to be selected
    • [x] Verify you can select/input/modify reviewers
    • [x] Verify after creating, requests are listed in pending states
    • [x] Verify you can't review own requests

Viewing & Approving/Denying Requests

Create a user with the role reviewer that allows you to review all requests, and delete them.

kind: role
version: v3
metadata:
  name: reviewer
spec:
  allow:
    review_requests:
      roles: ['*']

Assuming Approved Requests

Access Request Waiting Room

Strategy Reason

Create the following role:

kind: role
metadata:
  name: restrict
spec:
  allow:
    request:
      roles:
      - <some other role to assign user after approval>
  options:
    max_session_ttl: 8h0m0s
    request_access: reason
    request_prompt: <some custom prompt to show in reason dialogue>
version: v3

Strategy Always

With the previous role you created from Strategy Reason, change request_access to always:

Strategy Optional

With the previous role you created from Strategy Reason, change request_access to optional:

Account

Terminal

Node List Tab

Session Tab

Session Player

Invite Form

Login Form

Multi-factor Authentication (mfa)

Create/modify teleport.yaml and set the following authentication settings under auth_service

authentication:
  type: local
  second_factor: optional
  require_session_mfa: yes
  u2f:
    app_id: https://example.com:443
    facets:
    - https://example.com:443
    - https://example.com
    - example.com:443
    - example.com

MFA create, login, password reset

MFA require auth

Through the CLI, tsh login and register a u2f key with tsh mfa add (not supported in UI yet).

Using the same user as above:

RBAC

Create a role, with no allow.rules defined:

kind: role
metadata:
  name: test
spec:
  allow:
    app_labels:
      '*': '*'
    logins:
    - root
    node_labels:
      '*': '*'
  options:
    max_session_ttl: 8h0m0s
version: v3

Note: User has read/create access_request access to their own requests, despite resource settings

Add the following under spec.allow.rules to enable read access to the audit log:

  - resources:
      - event
      verbs:
      - list

Add the following to enable read access to recorded sessions

  - resources:
      - session
      verbs:
      - read

Add the following to enable read access to the roles

- resources:
      - role
      verbs:
      - list
      - read

Add the following to enable read access to the auth connectors

- resources:
      - auth_connector
      verbs:
      - list
      - read

Add the following to enable read access to users

  - resources:
      - user
      verbs:
      - list
      - read

Add the following to enable read access to trusted clusters

  - resources:
      - trusted_cluster
      verbs:
      - list
      - read

Performance/Soak Test @xacrimon @fspmarshall

Using tsh bench tool, perform the soak tests and benchmark tests on the following configurations:

Soak Tests

Run 4hour soak test with a mix of interactive/non-interactive sessions:

tsh bench --duration=4h user@teleport-monster-6757d7b487-x226b ls
tsh bench -i --duration=4h user@teleport-monster-6757d7b487-x226b ps uax

Observe prometheus metrics for goroutines, open files, RAM, CPU, Timers and make sure there are no leaks

Breaking load tests

Load system with tsh bench to the capacity and publish maximum numbers of concurrent sessions with interactive and non interactive tsh bench loads.

Application Access @r0mant @smallinsky

Database Access @r0mant @smallinsky

quinqu commented 3 years ago

When adding an OTP device with tsh mfa add and try to enter the code, teleport says the code must be 6 digits long and my input surely is. It still wont be accepted. Terminal output:

Choose device type [TOTP, U2F]: TOTP
Enter device name: tempdevice
Enter an OTP code from a *registered* device: 628304

Open your TOTP app and create a new manual entry with these fields:
  URL: <omitted> 
  Account name: <omitted>
  Secret key: <omitted>
  Issuer: <omitted> 
  Algorithm: SHA1
  Number of digits: 6
  Period: 30s

Once created, enter an OTP code generated by the app: 624072
TOTP code must be exactly 6 digits long, try again
Once created, enter an OTP code generated by the app: 624072
TOTP code must be exactly 6 digits long, try again
Once created, enter an OTP code generated by the app: 910046
TOTP code must be exactly 6 digits long, try again
Once created, enter an OTP code generated by the app: 426970
TOTP code must be exactly 6 digits long, try again
Once created, enter an OTP code generated by the app:
awly commented 3 years ago

@quinqu could you please file a bug for this and assign to me? It's likely I introduced the problem in 6.2

Joerger commented 3 years ago

Updating a user with tctl create -f user.yaml breaks the audit log and session recordings tabs in the Web UI - #6935

tcsc commented 3 years ago

@webvictim - I've added a test matrix for the tsh tests here so we don't stomp on each other. Or on ourselves. Feel free to edit as necessary.

New New (No Rec) Upgraded Upgraded (No Rec)
PASS PASS PASS PASS tsh ssh \<regular-node>
PASS PASS PASS PASS tsh ssh \<node-remote-cluster>
PASS PASS PASS PASS tsh ssh -A \<regular-node>
PASS PASS PASS PASS tsh ssh -A \<node-remote-cluster>
PASS PASS PASS PASS tsh ssh \<regular-node> ls
PASS PASS PASS PASS tsh ssh \<node-remote-cluster> ls
PASS PASS PASS PASS tsh join \<regular-node>
PASS PASS PASS PASS tsh join \<node-remote-cluster>
PASS *PASS PASS *PASS tsh play \<regular-node>
PASS *PASS PASS *PASS tsh play \<node-remote-cluster>
PASS PASS PASS PASS tsh scp \<regular-node>
PASS PASS PASS PASS tsh scp \<node-remote-cluster>
PASS PASS PASS PASS tsh ssh -L \<regular-node>
PASS PASS PASS PASS tsh ssh -L \<node-remote-cluster>
PASS PASS PASS PASS tsh ls
PASS PASS PASS PASS tsh clusters
tcsc commented 3 years ago

Encountered #6938 while testing: Panic when using tctl with remote auth server

kimlisa commented 3 years ago

mfa related bug, where scp upload/download does not work in the web ui: https://github.com/gravitational/teleport/issues/6939

r0mant commented 3 years ago

@Joerger @xacrimon Seeing https://github.com/gravitational/teleport/issues/6935 as well which Brian reported above.

Screen Shot 2021-05-19 at 12 06 17 PM

@xacrimon Looks like this file (dynamic.go) was a part of your RFD19 implementation, could this have caused it? Just need to add user.updated event to the switch probably.

xacrimon commented 3 years ago

@r0mant Resolved in #6949 and #6950 backport to v6.

fspmarshall commented 3 years ago

Changes introduced in #6731 break compatibility with older 6.X instances due to reliance on new GRPC methods (e.g. attempting to view audit events from UI of a 6.2 proxy results in unknown method GetEvents for service proto.AuthService error when dealing with a 6.1 auth server).

Teleport should fallback to using old event API if new one is not available.

cc: @xacrimon @kimlisa

xacrimon commented 3 years ago

@fspmarshall So this is a bit of an issue. The old events API does not support pagination but the IAuditLog interface expects it. Should we just ignore the new parameters introduced in RFD 19 and pretend pagination doesn't exist on fallback?

kimlisa commented 3 years ago

ui switchback bug (i am fixing): https://github.com/gravitational/teleport/issues/6960 @xacrimon related to #6935, unknown event bug: https://github.com/gravitational/teleport/issues/6959

fspmarshall commented 3 years ago

Should we just ignore the new parameters introduced in RFD 19 and pretend pagination doesn't exist on fallback?

@xacrimon Followed up in PR. Basically, I think we should pretend it doesn't exist when dealing with the first call (since that means we're getting the "first page", which is what the old API did), but we should return an error if startKey != "", since that means we're loading a subsequent page, which the old API can't do.

awly commented 3 years ago

@xacrimon @webvictim @fspmarshall @quinqu let me know if you're overloaded. Some other folks are done with their testing so I could re-distribute remaining tasks if needed.

quinqu commented 3 years ago

@awly i could use some help on the U2F second factor tests as i do not have a U2F device.

awly commented 3 years ago

@quinqu will do :+1:

awly commented 3 years ago

FYI everyone, if you find an issue while testing, please file a bug and put it into 6.2 milestone. That way I can track all the remaining work and questions.

xacrimon commented 3 years ago

I have previously assumed DynamoDB tests were running but they have not been. I still need to hook these up and run them before I can say everything is correct. I will make another comment but please do not cut before I confirm that everything is indeed working @awly. @russjones I've also merged the API compat PR. #6990 will need to be merged as well, I will ping for reviews when it is ready.

webvictim commented 3 years ago

Ran into some weird tsh logout behaviour, detailed in https://github.com/gravitational/teleport/issues/6992

Not sure if this is a blocker but I can't log out of all my clusters for some reason.

xacrimon commented 3 years ago

Okay. I have pinged reviews on #6990 and I sign off on everything working when it is merged. I’ve manually done some testing to make sure it works.

webvictim commented 3 years ago

Most Kubernetes tests are finished, just waiting on #6990 merge/backport (and rc.2 cut?) to verify the audit log entries:

Screenshot 2021-05-21 at 15 15 51
awly commented 3 years ago

All issues are either resolved or not caused by 6.2. Marking the testplan as done.

russjones commented 3 years ago

From @fspmarshall

6.2 - etcd - IoT

tsh bench --duration=30m root@loadtest-665c98bfb5-72w58 ls
* Requests originated: 17920
* Requests failed: 258
* Last error: connection closed
Histogram
Percentile Response Duration
---------- -----------------
25         4867 ms
50         6943 ms
75         9583 ms
90         14951 ms
95         20959 ms
99         40799 ms
100        65439 ms
tsh bench --interactive --duration=30m root@loadtest-665c98bfb5-9wk2b ps aux
* Requests originated: 17905
* Requests failed: 253
* Last error: connection error: desc = "transport: authentication handshake failed: EOF"
Histogram
Percentile Response Duration
---------- -----------------
25         4923 ms
50         7079 ms
75         9727 ms
90         15015 ms
95         20783 ms
99         41951 ms
100        64927 ms

6.2 - etcd - non-IoT

tsh bench --duration=30m root@loadtest-665c98bfb5-qcf82 ls
* Requests originated: 17983
* Requests failed: 23
* Last error: connection error: desc = "transport: authentication handshake failed: EOF"
Histogram
Percentile Response Duration
---------- -----------------
25         4719 ms
50         6567 ms
75         8703 ms
90         11143 ms
95         13439 ms
99         21263 ms
100        49183 ms
tsh bench --interactive --duration=30m root@loadtest-665c98bfb5-zfsrb ps aux
* Requests originated: 17970
* Requests failed: 17
* Last error: connection error: desc = "transport: authentication handshake failed: EOF"
Histogram
Percentile Response Duration
---------- -----------------
25         4655 ms
50         6391 ms
75         8327 ms
90         10703 ms
95         13079 ms
99         21759 ms
100        59423 ms