gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.62k stars 1.76k forks source link

Teleport 10 Test Plan #13340

Closed r0mant closed 2 years ago

r0mant commented 2 years ago

Manual Testing Plan

Below are the items that should be manually tested with each release of Teleport. These tests should be run on both a fresh install of the version to be released as well as an upgrade of the previous version of Teleport.

User accounting @xacrimon

Combinations @capnspacehook

For some manual testing, many combinations need to be tested. For example, for interactive sessions the 12 combinations are below.

Teleport with EKS/GKE @tigrato

Teleport with multiple Kubernetes clusters @tigrato

Note: you can use GKE or EKS or minikube to run Kubernetes clusters. Minikube is the only caveat - it's not reachable publicly so don't run a proxy there.

Teleport with FIPS mode @alistanis @r0mant

ACME @rudream

Migrations @hugoShaka

Command Templates

When interacting with a cluster, the following command templates are useful:

OpenSSH

# when connecting to the recording proxy, `-o 'ForwardAgent yes'` is required.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# the above command only forwards the agent to the proxy, to forward the agent
# to the target node, `-o 'ForwardAgent yes'` needs to be passed twice.
ssh -o "ForwardAgent yes" \
  -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p" \
  node.example.com

# when connecting to a remote cluster using OpenSSH, the subsystem request is
# updated with the name of the remote cluster.
ssh -o "ProxyCommand ssh -o 'ForwardAgent yes' -p 3023 %r@proxy.example.com -s proxy:%h:%p@foo.com" \
  node.foo.com

Teleport

# when connecting to a OpenSSH node, remember `-p 22` needs to be passed.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -p 22 node.example.com

# an agent can be forwarded to the target node with `-A`
tsh --proxy=proxy.example.com --user=<username> --insecure ssh -A -p 22 node.example.com

# the --cluster flag is used to connect to a node in a remote cluster.
tsh --proxy=proxy.example.com --user=<username> --insecure ssh --cluster=foo.com -p 22 node.foo.com

Teleport with SSO Providers @ptgott @Tener

tctl sso family of commands @Tener

tctl sso configure helps to construct a valid connector definition:

tctl sso test test a provided connector definition, which can be loaded from file or piped in with tctl sso configure or tctl get --with-secrets. Valid connectors are accepted, invalid are rejected with sensible error messages.

Teleport Plugins @marcoandredinis

AWS Node Joining @nklaassen

Docs

Passwordless @r0mant @espadolini

Passwordless requires tsh compiled with libfido2 for most operations (apart from Touch ID). Ask for a statically-built tsh binary for realistic tests.

Touch ID requires a properly built and signed tsh binary. Ask for a pre-release binary so you may run the tests.

This sections complements "Users -> Managing MFA devices". Ideally both macOS and Linux tsh binaries are tested for FIDO2 items.

WEB UI @kimlisa @rudream @hatched

Main

For main, test with a role that has access to all resources.

Top Nav

Side Nav

Servers aka Nodes

Applications

Databases

Audit log

Users

Auth Connectors

Managed Clusters

Help & Support

Access Requests

Access Request is a Enterprise feature and is not available for OSS.

Creating Access Requests (Role Based)

Create a role with limited permissions allow-roles-and-nodes. This role allows you to see the Role screen and ssh into all nodes.

kind: role
metadata:
  name: allow-roles-and-nodes
spec:
  allow:
    logins:
    - root
    node_labels:
      '*': '*'
    rules:
    - resources:
      - role
      verbs:
      - list
      - read
  options:
    max_session_ttl: 8h0m0s
version: v5

Create another role with limited permissions allow-users-with-short-ttl. This role session expires in 4 minutes, allows you to see Users screen, and denies access to all nodes.

kind: role
metadata:
  name: allow-users-with-short-ttl
spec:
  allow:
    rules:
    - resources:
      - user
      verbs:
      - list
      - read
  deny:
    node_labels:
      '*': '*'
  options:
    max_session_ttl: 4m0s
version: v5

Create a user that has no access to anything but allows you to request roles:

kind: role
metadata:
  name: test-role-based-requests
spec:
  allow:
    request:
      roles:
      - allow-roles-and-nodes
      - allow-users-with-short-ttl
      suggested_reviewers:
      - random-user-1
      - random-user-2
version: v5

Creating Access Requests (Search Based)

Create a role with access to searcheable resources (apps, db, kubes, nodes, desktops). The template searcheable-resources is below.

kind: role
metadata:
  name: searcheable-resources
spec:
  allow:
    app_labels:  # just example labels
      label1-key: label1-value
      env: [dev, staging] 
    db_labels:
      '*': '*'   # asteriks gives user access to everything
    kubernetes_labels:
      '*': '*' 
    node_labels:
      '*': '*'
    windows_desktop_labels:
      '*': '*'
version: v5

Create a user that has no access to resources, but allows you to search them:

kind: role
metadata:
  name: test-search-based-requests
spec:
  allow:
    request:
      search_as_roles:
      - searcheable resources
      suggested_reviewers:
      - random-user-1
      - random-user-2
version: v5

Viewing & Approving/Denying Requests

Create a user with the role reviewer that allows you to review all requests, and delete them.

kind: role
version: v3
metadata:
  name: reviewer
spec:
  allow:
    review_requests:
      roles: ['*']

Assuming Approved Requests (Role Based)

Assuming Approved Requests (Search Based)

Access Request Waiting Room

Strategy Reason

Create the following role:

kind: role
metadata:
  name: waiting-room
spec:
  allow:
    request:
      roles:
      - <some other role to assign user after approval>
  options:
    max_session_ttl: 8h0m0s
    request_access: reason
    request_prompt: <some custom prompt to show in reason dialogue>
version: v3

Strategy Always

With the previous role you created from Strategy Reason, change request_access to always:

Strategy Optional

With the previous role you created from Strategy Reason, change request_access to optional:

Terminal

Node List Tab

Session Tab

Session Player

Invite and Reset Form

Login Form and Change Password

Multi-factor Authentication (mfa)

Create/modify teleport.yaml and set the following authentication settings under auth_service

authentication:
  type: local
  second_factor: optional
  require_session_mfa: yes
  webauthn:
    rp_id: example.com

MFA invite, login, password reset, change password

MFA require auth

Go to Account Settings > Two-Factor Devices and register a new device

Using the same user as above:

MFA Management

Passwordless

Cloud

From your cloud staging account, change the field teleportVersion to the test version.

$ kubectl -n <namespace> edit tenant

Recovery Code Management

Invite/Reset

Recovery Flow: Add new mfa device

Recovery Flow: Change password

Recovery Email

RBAC

Create a role, with no allow.rules defined:

kind: role
metadata:
  name: rbac
spec:
  allow:
    app_labels:
      '*': '*'
    logins:
    - root
    node_labels:
      '*': '*'
  options:
    max_session_ttl: 8h0m0s
version: v3

Note: User has read/create access_request access to their own requests, despite resource settings

Add the following under spec.allow.rules to enable read access to the audit log:

  - resources:
      - event
      verbs:
      - list

Add the following to enable read access to recorded sessions

  - resources:
      - session
      verbs:
      - read

Add the following to enable read access to the roles

- resources:
      - role
      verbs:
      - list
      - read

Add the following to enable read access to the auth connectors

- resources:
      - auth_connector
      verbs:
      - list
      - read

Add the following to enable read access to users

  - resources:
      - user
      verbs:
      - list
      - read

Add the following to enable read access to trusted clusters

  - resources:
      - trusted_cluster
      verbs:
      - list
      - read

Performance/Soak Test @rosstimothy @espadolini

Using tsh bench tool, perform the soak tests and benchmark tests on the following configurations:

Soak Tests

Run 4hour soak test with a mix of interactive/non-interactive sessions:

tsh bench --duration=4h user@teleport-monster-6757d7b487-x226b ls
tsh bench -i --duration=4h user@teleport-monster-6757d7b487-x226b ps uax

Observe prometheus metrics for goroutines, open files, RAM, CPU, Timers and make sure there are no leaks

Breaking load tests

Load system with tsh bench to the capacity and publish maximum numbers of concurrent sessions with interactive and non interactive tsh bench loads.

Teleport with Cloud Providers

AWS @lxea

GCP @EdwardDowling

IBM @r0mant

Application Access @strideynet

Database Access @smallinsky

TLS Routing @smallinsky

Desktop Access

Basic Sessions (@LKozlowski)

User Input (@ibeckermayer)

Binaries compatibility @fheinecke

Machine ID @timothyb89

SSH

With a default Teleport instance configured with a SSH node:

Ensure the above tests are completed for both:

DB Access

With a default Postgres DB instance, a Teleport instance configured with DB access and a bot user configured:

Teleport Connect @ravicious @gzdunek @avatus

Host users creation @jakule

Host users creation docs Host users creation RFD

CA rotations @espadolini

IP-based validation

SSH @probakowski

zmb3 commented 2 years ago

Looks like we "regressed" and increased the GLIBC dependency again.

Edit: this appears to be related to the Rust version. Reverting to 1.58.1 seems to fix it.

I will downgrade for now: https://github.com/gravitational/teleport/pull/13544

codingllama commented 2 years ago

A few preliminary findings:

  1. tctl and teleport always print the warning below on macOS, which I think could be downgraded:
$ tctl -c ./teleport.yaml users ls
> 2022-06-15T17:29:04-03:00 WARN             Disabling host user creation as this feature is only available on Linux config/configuration.go:998

$ teleport start -c ./teleport.yaml
> 2022-06-15T17:28:58-03:00 WARN             Disabling host user creation as this feature is only available on Linux config/configuration.go:998
  1. tctl still mentions the (removed) "admin" role:
$ tctl -c ./teleport.yaml users add --help
(...)
Examples:

  > tctl users add --roles=admin,dba joe

  This creates a Teleport account 'joe' who will assume the roles 'admin' and 'dba'
  To see the permissions of 'admin' role, execute 'tctl get role/admin'
  1. tsh Touch ID authn isn't respecting users and picking the "oldest" credential

Repro by adding >1 credential and then >1 users. 😢

I'll focus on (3), (1) and (2) are easy pickings if someone wants to fix them.

r0mant commented 2 years ago

@lxea Could you take a look at "1" and "2" from Alan's comment above?

GavinFrazar commented 2 years ago

I noticed in the audit log when I do anything on my database (mysql) the log entries always show [undefined], even if I select a database explicitly during my session with "use ". Looks like this:

User [remote-alice-cluster1] has executed query [show tables] in database [undefined] on [testmysql]
User [remote-alice-cluster1] has executed query [show databases] in database [undefined] on [testmysql]
User [remote-alice-cluster1] has changed default database to [foodb] on [testmysql]

edit: found an issue for this #5903

It appears the behavior is to always show the database name used on login.

So if I do $ tsh db login --db-name=foodb testmysql or tsh db connect --db-name=foodb testmysql then all audit logs in that session will show [foodb] as the database. If I switch databases in mysql with use otherdb, then audit log continues to show actions as if they were done in [foodb]. If I don't specify any --db-name with login/connect then it's always [undefined].

Joerger commented 2 years ago

I found a tsh ssh -J regression related to TLS routing - https://github.com/gravitational/teleport/issues/13554

strideynet commented 2 years ago

tsh play <chunk-id> can fetch and print a session chunk archive.

Not concerned this is a blocker, and may actually just be the test plan being incorrect. This command fails with offset 0 not found for session. This is because by default tsh play attempts to play a session back to the PTY which is not compatible with application access session recordings. Running the command with --format json succeeds. Looking at the blame of the code, it doesn't look like this is a recent regression, and may have always been the case.

Do we want to update the test plan with the correct command ? I imagine eventually it would be nice if user's didn't have to provide this flag for the command to work, but given how we currently switch in the implementation between two modes, it will probably involve rewriting onPlay to support that.

strideynet commented 2 years ago

Discovered a regression with using the configuration output by teleport configure: https://github.com/gravitational/teleport/issues/13558

I'll write a fix for this today and we should be able to get it merged down asap.


This fix has been merged down to branch/v10 and I can confirm the regression appears to be fixed.

rosstimothy commented 2 years ago

Discovered some backwards incompatibility with SSO login: https://github.com/gravitational/teleport/issues/13575

Edit (Joerger): Fixed in https://github.com/gravitational/teleport/pull/13589

Joerger commented 2 years ago

Found a regression in tsh join, I'll try fixing it.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0x17b3fc0]

goroutine 1 [running]:
github.com/gravitational/teleport/api/types.(*SessionTrackerV1).GetAddress(0x390c700?)
    /home/bjoerger/gravitational/teleport/api/types/session_tracker.go:274
github.com/gravitational/teleport/lib/client.(*TeleportClient).Join(0xc00025e700, {0x3931f90, 0xc0000541f8}, {0x341aee2, 0x4}, {0x3426c0b?, 0x7}, {0x7ffd260b70f7, 0x24}, {0x0, ...})
    /home/bjoerger/gravitational/teleport/lib/client/api.go:1976 +0x6f2
main.onJoin.func1()
    /home/bjoerger/gravitational/teleport/tool/tsh/tsh.go:2584 +0x65
github.com/gravitational/teleport/lib/client.RetryWithRelogin({0x3932000, 0xc000a4c4b0}, 0xc00025e700, 0xc000b3e550)
    /home/bjoerger/gravitational/teleport/lib/client/api.go:719 +0x4e
main.onJoin(0xc0006ac000)
    /home/bjoerger/gravitational/teleport/tool/tsh/tsh.go:2583 +0x1b5
main.Run({0x39330d8, 0xc0002ae780}, {0xc00004e090, 0x3, 0x3}, {0x0, 0x0, 0xc0000021a0?})
    /home/bjoerger/gravitational/teleport/tool/tsh/tsh.go:859 +0x12445
main.main()
    /home/bjoerger/gravitational/teleport/tool/tsh/tsh.go:396 +0x318

Edit: fixed in https://github.com/gravitational/teleport/pull/13596

Joerger commented 2 years ago

Possible regression: I can't join/view my own sessions despite having permissions to do so. Am I missing something in https://goteleport.com/docs/ver/10.0/access-controls/reference/?

https://github.com/gravitational/teleport/issues/13595

GavinFrazar commented 2 years ago

Some issues I ran into while testing kube access locally:

  1. tsh kube exec --tty --stdin shell-demo /bin/sh leads to panic:
    Click for example

> tsh kube exec --tty --stdin shell-demo /bin/sh 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x106905790]

goroutine 1 [running]:
main.(*StreamOptions).SetupTTY(0x14000abe410)
    /Users/gavin/work/teleport/tool/tsh/kube.go:281 +0x180
main.(*ExecOptions).Run(0x14000abe410)
    /Users/gavin/work/teleport/tool/tsh/kube.go:356 +0x280
main.(*kubeExecCommand).run(0x14000674600, 0x0?)
    /Users/gavin/work/teleport/tool/tsh/kube.go:467 +0x388
main.Run({0x1075eac90, 0x140006f1540}, {0x140001b6010, 0x6, 0x6}, {0x0, 0x0, 0x300000002?})
    /Users/gavin/work/teleport/tool/tsh/tsh.go:896 +0x12e98
main.main()
    /Users/gavin/work/teleport/tool/tsh/tsh.go:396 +0x2c0
 [19:16:57] gavin@mac ~ [SIGINT] 
> kubectl exec -it shell-demo -- /bin/sh
# whoami
root

  1. tsh kube credentials issue when --teleport-cluster flag does not match $TELEPORT_CLUSTER
    • On first use, you get an error. If you immediately run the same command, it prints the credentials. I ran into this because teleport modifies kubeconfig to execute this command to authenticate, if you're not already logged into teleport. So essentially, rm -rf ~/.tsh && kubectl get pods prompts me for my password and then prints an error message, but if I just run kubectl get pods again, it works.
Click for example

```markdown [19:08:20] gavin@mac ~ [1] > rm -rf ~/.tsh [19:09:07] gavin@mac ~ > tenv show TELEPORT_CLUSTER=cluster2 TELEPORT_DEV_OUT=/tmp/out2.log TELEPORT_CONFIG_FILE=/Users/gavin/teleport-config/nodes/cluster2.yaml TELEPORT_USER=alice TELEPORT_DEV_CONFIG_FILE=/Users/gavin/teleport-config/nodes/cluster2.yaml TELEPORT_PROXY=proxy2.local.gd:4080 [19:09:11] gavin@mac ~ > bat ~/.kube/config | rg "exec" -A 10 exec: apiVersion: client.authentication.k8s.io/v1beta1 args: - kube - credentials - --kube-cluster=minikube - --teleport-cluster=cluster1 - --proxy=proxy1.local.gd:3080 - --insecure command: /Users/gavin/work/teleport/build/tsh env: null [19:09:41] gavin@mac ~ > kubectl get pods Enter password for Teleport user alice: WARNING: You are using insecure connection to SSH proxy https://proxy1.local.gd:3080 ERROR: SSH cert not available Unable to connect to the server: getting credentials: exec: executable /Users/gavin/work/teleport/build/tsh failed with exit code 1 [19:09:57] gavin@mac ~ [1] > kubectl get pods NAME READY STATUS RESTARTS AGE shell-demo 1/1 Running 0 75m ```

Tener commented 2 years ago

@GavinFrazar

tsh kube credentials issue when --teleport-cluster flag does not match $TELEPORT_CLUSTER

I'm not sure if this would fix the outlined issue, but I noticed recently that a couple of --cluster and --teleport-cluster flags should really use .Envar(clusterEnvVar) in their mix. At the time I didn't realise this may cause issues like the one you outlined, but perhaps the fix is as simple as adding that call to the mix as appropriate. For example:

1)

    c.Flag("teleport-cluster", "Name of the teleport cluster to get credentials for.").Required().StringVar(&c.teleportCluster)

becomes

c.Flag("teleport-cluster", "Name of the teleport cluster to get credentials for.").Required().Envar(clusterEnvVar).StringVar(&c.teleportCluster)

2)

    ssh.Flag("cluster", clusterHelp).Short('c').StringVar(&cf.SiteName)

becomes

    ssh.Flag("cluster", clusterHelp).Envar(clusterEnvVar).Short('c').StringVar(&cf.SiteName)
Tener commented 2 years ago

@atburke

Regression due to https://github.com/gravitational/teleport/pull/12934:

Basically the logic between onListDatabases and listDatabasesAllClusters is out of sync. The former contains the correct code to fetch roles:

https://github.com/gravitational/teleport/blob/77b35b8dd67953229ae2a9a824113b83c8ba426c/tool/tsh/db.go#L81-L104

The latter does not (profile.Roles):

https://github.com/gravitational/teleport/blob/77b35b8dd67953229ae2a9a824113b83c8ba426c/tool/tsh/db.go#L163-L167

The result is that we try to get definition for role which we do not have in the leaf cluster and we may not have permission to do so.

For example, given clusters boson.tener.io and quark.tener.io and the trusted cluster role mapping giving only access role:

kind: trusted_cluster
metadata:
  id: 1655472056507184000
  name: boson.tener.io
spec:
  enabled: true
  role_map:
  - local:
    - access
    remote: access
  token: foo
  tunnel_addr: boson.tener.io:3080
  web_proxy_addr: boson.tener.io:3080
version: v2

We will get errors when tsh tries to read the editor and auditor roles from quark.tener.io. This is an error because the mapping only gives access role. The code in onListDatabases correctly handles that case.

$ tsh clusters
Cluster Name   Status Cluster Type Labels Selected
-------------- ------ ------------ ------ --------
boson.tener.io online root                *
quark.tener.io online leaf

$ tsh db ls
Name Description Allowed Users Labels Connect
---- ----------- ------------- ------ -------

$ tsh db --cluster=quark.tener.io ls
Name                            Description         Allowed Users     Labels  Connect
------------------------------- ------------------- ----------------- ------- ------------------------------------------------------------------------
> qmongo (user: alice)                              [alice bob tener]         tsh db connect --cluster=quark.tener.io --db-name=<name> qmongo
> qmongo-insecure (user: alice)                     [alice bob tener]         tsh db connect --cluster=quark.tener.io --db-name=<name> qmongo-insecure
redisquark                      Quark Redis example [alice bob tener] env=dev

$ tsh db --cluster=quark.tener.io ls --all
ERROR: access denied to perform action "read" on "role"

I'm unlikely to have the time to fix it before my PTO.

jakule commented 2 years ago

I found two issues related to the host user creations https://github.com/gravitational/teleport/issues/13663 https://github.com/gravitational/teleport/issues/13662

nklaassen commented 2 years ago

found an issue with the "Instance" role and the EC2 join method https://github.com/gravitational/teleport/issues/13677

LKozlowski commented 2 years ago

I found an issue with LDAP attribute labeling - it does not work correctly: #13680

LKozlowski commented 2 years ago

Regexp-based host labeling applies across all desktops, regardless of origin.

I don't know if this is an issue or not, but I had a hard time figuring out why it does not work the way I would expect it to work. There is an inconsistency between how we treat LDAP discovered hosts vs static hosts.

Scenario 1: LDAP hosts

windows_desktop_service:
...
  discovery:
    base_dn: "*"
  host_labels:
    - match: '^.*\.example\.com$'
      labels:
        environment: dev

Using this configuration if the discovered host has dns host name set as EXAMPLE-82K6DLP.example.com we'll get regexp match and that host will have an extra label environment/dev

Scenario 2: Static hosts

windows_desktop_service:
...
  hosts:
    - EXAMPLE-82K6DLP.example.com
  host_labels:
    - match: '^.*\.example\.com$'
      labels:
        environment: dev

Using this configuration, with the same regexp and the same dns host name for a static host we won't get a regexp match and this host won't have an extra label.

The reason being for that is in case of static hosts, we do try to match regexp against hostname:port. In our example we would compare our regex with EXAMPLE-82K6DLP.example.com:3389 which would fail to match because of the $ at the end of our regexp.

https://github.com/gravitational/teleport/blob/ca520999c1f3e929e98f37c551532446bfbfbbd7/lib/srv/desktop/windows_server.go#L982-L989

Since I don't know if this was intended or we should fix it by changing the behavior of it to just use host without port it would be great if @zmb3 could take a look into my comment as I think he is the author of this functionality.

zmb3 commented 2 years ago

@LKozlowski I don't think we ever noticed this before, but technically regex-based labeling is working as intended, we're just not clear in the docs or examples that the port is included.

Feels like the simplest thing would be to remove the $ from the examples and mention in the docs that the port is included in the match for static hosts.

espadolini commented 2 years ago

That will end up match anything with an example.com prefix tho; perhaps the docs should add a (:3389)? before the $ instead, if that works (or a (:\d+)?, if we want to be pedantic).

ibeckermayer commented 2 years ago

I found an issue with desktop access scroll behavior: https://github.com/gravitational/teleport/issues/13690

zmb3 commented 2 years ago

That will end up match anything with an example.com prefix tho; perhaps the docs should add a (:3389)? before the $ instead, if that works (or a (:\d+)?, if we want to be pedantic).

Sure, that works. Or I'm also fine not matching against the host and not the port.

I don't see this as a major issue since it has always been this way, and few people use static hosts.

nklaassen commented 2 years ago

~It seems unfortunate to have these error logs by default, I thought I saw a PR to remove them but now I can't find it, are we removing these @lxea @atburke? I haven't intentionally enabled either of these features, and my log just fills with these errors over time.~

2022-06-21T23:22:55Z ERRO [EC2LABELS] Error fetching EC2 tags: object not found ec2/ec2.go:144
2022-06-21T23:22:55Z ERRO             Error during temporary user cleanup: group: unknown group teleport-system srv/usermgmt.go:341

Edit: my bad, these are already fixed, sorry for the noise

atburke commented 2 years ago

@nklaassen #13529 should fix the EC2 labels error.

LKozlowski commented 2 years ago

That will end up match anything with an example.com prefix tho; perhaps the docs should add a (:3389)? before the $ instead, if that works (or a (:\d+)?, if we want to be pedantic).

Sure, that works. Or I'm also fine not matching against the host and not the port.

I don't see this as a major issue since it has always been this way, and few people use static hosts.

I just wanted to bring it up as it wasn't clear for me when I was testing it, but I agree that it is working fine. As you said, we just need to either update docs or slightly update the code. Anyway, I'll mark it in the test plan as working and we'll just improve it later so it doesn't block the v10 release.

espadolini commented 2 years ago

Found a compatibility issue between v9 leafs and v10 roots related to the new database CA:

ravicious commented 2 years ago

Is tsh status supposed to report -teleport-internal-join as one of the SSH logins? I can see it in the logins list for v10 clusters but not for the ones running older versions of Teleport.

espadolini commented 2 years ago

Is tsh status supposed to report -teleport-internal-join as one of the SSH logins?

We should probably filter out that one and the -teleport-nologin-<uuid> ones.

Joerger commented 2 years ago

ssh -J <teleport-proxy> doesn't work with tls routing (since v8.0.0) - https://github.com/gravitational/teleport/issues/13833

fheinecke commented 2 years ago

tsh does not work on Debian 9 due to glibc 2.25 dependency - #13894

zmb3 commented 2 years ago

I'm seeing a "session data" event that I'm not used to seeing which renders with a missing session ID in the audit log.

image

It's not just a UI thing, the JSON for the event has "sid": "".

rosstimothy commented 2 years ago

Direct Dial Nodes unreachable because they are reporting an address of [::]:3022 https://github.com/gravitational/teleport/issues/13898

rosstimothy commented 2 years ago

Reverse Tunnel Nodes getting stuck initializing and not connecting: https://github.com/gravitational/teleport/issues/13911

rosstimothy commented 2 years ago

etcd 500 TC Scaling Test

image

https://teleportcoreteam.grafana.net/goto/m-ivFEqnk?orgId=1

codingllama commented 2 years ago

Something minor I just noticed: my (idle) local teleport was spamming a session recording warning (shutdown logs included):

2022-06-27T17:58:47-03:00 [UPLOAD]    WARN Skipped session recording 25366a4e-03f8-47e6-a4ea-6c54d1290c4f.tar. error:[session file could be corrupted or is using unsupported format: session recording 25366a4e-03f8-47e6-a4ea-6c54d1290c4f is either corrupted or is using unsupported format, remove the file /path/to/teleport/log/upload/streaming/default/25366a4e-03f8-47e6-a4ea-6c54d1290c4f.tar to correct the problem, remove the /path/to/teleport/log/upload/streaming/default/25366a4e-03f8-47e6-a4ea-6c54d1290c4f.error file to retry the upload] filesessions/fileasync.go:253
^C2022-06-27T17:58:51-03:00 [PROC:1]    INFO Got signal "interrupt", exiting immediately. pid:27917.1 service/signals.go:83
2022-06-27T17:58:51-03:00 [PROC:1]    WARN Sync rotation state cycle failed. Retrying in ~10s pid:27917.1 service/connect.go:682
2022-06-27T17:58:51-03:00 [AUDIT:1]   INFO File uploader is shutting down. pid:27917.1 service/service.go:2480
2022-06-27T17:58:51-03:00 [AUDIT:1]   INFO File uploader has shut down. pid:27917.1 service/service.go:2482

I didn't do anything special with the cluster today, other than a few login attempts. Posting here in case it rings a bell for someone.

avatus commented 2 years ago

Something minor I just noticed: my (idle) local teleport was spamming a session recording warning (shutdown logs included):

2022-06-27T17:58:47-03:00 [UPLOAD]    WARN Skipped session recording 25366a4e-03f8-47e6-a4ea-6c54d1290c4f.tar. error:[session file could be corrupted or is using unsupported format: session recording 25366a4e-03f8-47e6-a4ea-6c54d1290c4f is either corrupted or is using unsupported format, remove the file /path/to/teleport/log/upload/streaming/default/25366a4e-03f8-47e6-a4ea-6c54d1290c4f.tar to correct the problem, remove the /path/to/teleport/log/upload/streaming/default/25366a4e-03f8-47e6-a4ea-6c54d1290c4f.error file to retry the upload] filesessions/fileasync.go:253
^C2022-06-27T17:58:51-03:00 [PROC:1]    INFO Got signal "interrupt", exiting immediately. pid:27917.1 service/signals.go:83
2022-06-27T17:58:51-03:00 [PROC:1]    WARN Sync rotation state cycle failed. Retrying in ~10s pid:27917.1 service/connect.go:682
2022-06-27T17:58:51-03:00 [AUDIT:1]   INFO File uploader is shutting down. pid:27917.1 service/service.go:2480
2022-06-27T17:58:51-03:00 [AUDIT:1]   INFO File uploader has shut down. pid:27917.1 service/service.go:2482

I didn't do anything special with the cluster today, other than a few login attempts. Posting here in case it rings a bell for someone.

This happened to me as well and adding auth_service.session_recording = off into the config failed to stop the warning. If that provides any further context

espadolini commented 2 years ago

my (idle) local teleport was spamming a session recording warning

Should be fixed by https://github.com/gravitational/teleport/pull/13826, fixing the warning in a running cluster involves manually deleting the file in the recordings I think.

r0mant commented 2 years ago

Can't get passwordless scenario to work as described in the test plan:

  1. Adding touchid device using tsh mfa add
  2. Touchid device is visible in tsh mfa ls and tsh touchid ls (the latter also brings up touchid prompt) ✅
  3. Running tsh -d login --proxy=root.gravitational.io:3080 --auth=passwordless doesn't work, asking to tap a security key (which I didn't register any separately) ❌
➜  e git:(afa3414) ✗ tsh login --proxy=root.gravitational.io:3080 --auth=passwordless
Tap your security key
^CERROR: context canceled

Logs:

➜  e git:(afa3414) ✗ tsh -d login --proxy=root.gravitational.io:3080 --auth=passwordless
DEBU [CLIENT]    open /Users/r0mant/.tsh/root.gravitational.io.yaml: no such file or directory client/api.go:1052
INFO [CLIENT]    No teleport login given. defaulting to r0mant client/api.go:1394
INFO [CLIENT]    no host login given. defaulting to r0mant client/api.go:1404
INFO [CLIENT]    [KEY AGENT] Connected to the system agent: "/private/tmp/com.apple.launchd.0G1kn68Tdf/Listeners" client/api.go:3934
DEBU [CLIENT]    attempting to use loopback pool for local proxy addr: root.gravitational.io:3080 client/api.go:3892
DEBU [CLIENT]    reading self-signed certs from: /var/lib/teleport/webproxy_cert.pem client/api.go:3900
DEBU [CLIENT]    could not open any path in: /var/lib/teleport/webproxy_cert.pem client/api.go:3904
DEBU             Attempting GET root.gravitational.io:3080/webapi/ping/passwordless webclient/webclient.go:115
DEBU [CLIENT]    attempting to use loopback pool for local proxy addr: root.gravitational.io:3080 client/api.go:3892
DEBU [CLIENT]    reading self-signed certs from: /var/lib/teleport/webproxy_cert.pem client/api.go:3900
DEBU [CLIENT]    could not open any path in: /var/lib/teleport/webproxy_cert.pem client/api.go:3904
DEBU [CLIENT]    HTTPS client init(proxyAddr=root.gravitational.io:3080, insecure=false) client/weblogin.go:233
DEBU             Attempting platform login webauthncli/api.go:97
DEBU             Platform login failed, falling back to cross-platform error:[credential not found] webauthncli/api.go:103
DEBU             FIDO2: Using libfido2 for assertion webauthncli/api.go:113
DEBU             FIDO2: Info for device ioreg://4294970624: &libfido2.DeviceInfo{Versions:[]string{"U2F_V2", "FIDO_2_0", "FIDO_2_1_PRE"}, Extensions:[]string{"credProtect", "hmac-secret"}, AAGUID:[]uint8{0xee, 0x88, 0x28, 0x79, 0x72, 0x1c, 0x49, 0x13, 0x97, 0x75, 0x3d, 0xfc, 0xce, 0x97, 0x7, 0x2a}, Options:[]libfido2.Option{libfido2.Option{Name:"rk", Value:"true"}, libfido2.Option{Name:"up", Value:"true"}, libfido2.Option{Name:"plat", Value:"false"}, libfido2.Option{Name:"clientPin", Value:"false"}, libfido2.Option{Name:"credentialMgmtPreview", Value:"true"}}, Protocols:[]uint8{0x1}} webauthncli/fido2.go:658
DEBU             FIDO2: Device ioreg://4294970624: filtered due to lack of UV webauthncli/fido2.go:137
Tap your security key
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
DEBU             FIDO2: Selecting devices error:[no suitable devices found] webauthncli/fido2.go:612
^C

cc @codingllama

codingllama commented 2 years ago

Can't get passwordless scenario to work as described in the test plan:

@r0mant could you double-check that you are using tsh from the signed/notarized/etc tsh.app bundle? I downloaded the tsh-v10.0.0-alpha.2.pkg installer and cleared the testplan without problems using it. Hit me up on Slack if you still have issues.

espadolini commented 2 years ago

@codingllama @r0mant all clear on the passwordless test plan for me on macOS.

rosstimothy commented 2 years ago

etcd Soak Test

kubectl logs -n loadtest-tross soaktest-pvnlr-6gv5f -f
+ tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth -l root ls -f names
node-65c8f5c9db-5zzfd
iot-node-5b4f7757f8-f2966

----Direct Dial Node Test----
+ tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@node-65c8f5c9db-5zzfd ls

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         157 ms
50         162 ms
75         168 ms
90         174 ms
95         178 ms
99         193 ms
100        474 ms

+ tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@node-65c8f5c9db-5zzfd ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         159 ms
50         164 ms
75         170 ms
90         175 ms
95         180 ms
99         195 ms
100        5179 ms

+ tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m root@iot-node-5b4f7757f8-f2966 ls
----Reverse Tunnel Node Test----

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         155 ms
50         160 ms
75         166 ms
90         172 ms
95         178 ms
99         193 ms
100        418 ms

+ tsh --insecure --proxy=monster.gravitational.co:3080 -i /etc/teleport/auth bench --duration=30m --interactive root@iot-node-5b4f7757f8-f2966 ps aux

* Requests originated: 17999
* Requests failed: 0

Histogram

Percentile Response Duration
---------- -----------------
25         154 ms
50         159 ms
75         165 ms
90         170 ms
95         175 ms
99         192 ms
100        5171 ms

etcd 10k Reverse Tunnel Nodes

image

https://teleportcoreteam.grafana.net/goto/vJFIH33nk?orgId=1

etcd 10k Direct Dial Nodes

image

https://teleportcoreteam.grafana.net/goto/yky9Oqqnz?orgId=1

russjones commented 2 years ago

Aggregate last 3 releases.

Backend Cluster Size Mode PTY 8.0 9.0 10.0
etcd 10k Regular No 3335 ms 700 ms 474 ms
etcd 10k Regular Yes 4647 ms 393 ms 5179 (99%: 195ms)
etcd 10k Tunnel No 4259 ms 143 ms 418 ms
etcd 10k Tunnel Yes 3143 ms 799 ms 5171 ms (99%: 192ms)
DynamoDB 10k Regular No 5147 ms
DynamoDB 10k Regular Yes 222 ms
DynamoDB 10k Tunnel No 235 ms
DynamoDB 10k Tunnel Yes 198 ms
DynamoDB 1 Regular No 1824 ms
DynamoDB 1 Regular Yes 1483 ms
DynamoDB 1 Tunnel No 2125 ms
DynamoDB 1 Tunnel Yes 2002 ms
fspmarshall commented 2 years ago

500 TC Scaling Test (DynamoDB)

500-tc

note: Initial dynamo 10k tests are not complete yet due to issues with the test automation, but I've gotten up to a 6k dynamo cluster without any issues on teleport's end of things. Working on re-running with different automation.

fspmarshall commented 2 years ago

10K Dynamo IoT

edit: See https://github.com/gravitational/teleport/issues/13340#issuecomment-1180681544 for updated bench numbers.

tsh bench --duration=30m root@node-848df68b94-zzxjg ls

* Requests originated: 17934
* Requests failed: 109
* Last error: EOF

Histogram

Percentile Response Duration 
---------- ----------------- 
25         5939 ms           
50         9655 ms           
75         13911 ms          
90         16655 ms          
95         17519 ms          
99         18351 ms          
100        55071 ms
tsh bench --duration=30m --interactive root@node-848df68b94-zzw65 ps aux

* Requests originated: 17903
* Requests failed: 22
* Last error: failed connecting to node node-848df68b94-zzw65. 

Histogram

Percentile Response Duration 
---------- ----------------- 
25         6115 ms           
50         9879 ms           
75         14103 ms          
90         16751 ms          
95         17583 ms          
99         18431 ms          
100        45471 ms
10k-dynamo-iot

Note: benches run concurrently with scaling and against nodes in a different region/cloud, which I think explains the differences in response duration. Looking into it.

fspmarshall commented 2 years ago

10K Dynamo Non-IoT

tsh bench --duration=30m root@172.31.4.81 ls

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         185 ms            
50         197 ms            
75         211 ms            
90         232 ms            
95         251 ms            
99         358 ms            
100        2161 ms
tsh bench --duration=30m --interactive root@172.31.9.206 ps aux

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         193 ms            
50         206 ms            
75         221 ms            
90         240 ms            
95         260 ms            
99         418 ms            
100        4579 ms
10k-dynamo

Note: these benches were run against individual bare-metal nodes within a 2-node cluster with tsh located within the same vpc as the auth, proxy, and nodes.

fspmarshall commented 2 years ago

DynamoDB Small Cluster Bench

(previously posted dynamodb bench numbers were from a 10k cluster with sub-optimal network conditions, and therefore not particularly useful for comparison)

tsh bench --duration=30m root@ip-172-31-4-81-us-west-2-compute-internal ls

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         198 ms            
50         210 ms            
75         222 ms            
90         238 ms            
95         255 ms            
99         372 ms            
100        3495 ms
tsh bench --duration=30m --interactive root@ip-172-31-9-206-us-west-2-compute-internal ps aux

* Requests originated: 17998
* Requests failed: 0

Histogram

Percentile Response Duration 
---------- ----------------- 
25         221 ms            
50         231 ms            
75         244 ms            
90         262 ms            
95         280 ms            
99         466 ms            
100        2003 ms