gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.52k stars 1.75k forks source link

Sluggish performance & timeouts with RBAC for Sessions enabled #12498

Closed deusxanima closed 2 years ago

deusxanima commented 2 years ago

Expected behavior: Using RBAC for sessions should not cause Session Recordings page to display an endless loading spinner or take 30-60 seconds to pull up a session recording on an otherwise low-traffic cluster.

Current behavior: If a role is configured with RBAC for sessions, navigating to or refreshing Session Recordings page at times causes the page to display an endless loading spinner. This behavior is observed even with a single user/single recording cluster where Session Recording page may take up to 30-60 seconds to pull up the session recording. DynamoDB table limits are far above the read/write load observed during this time. Logs on the auth/proxy server also seem to lag at this time and are not updated until the session page is fully refreshed.

Bug details:

  1. Log in as the user and create a brief session by SSH-ing into a node and exiting after issuing a sample command (ex: echo test)
  2. Navigate to the Session Recordings page and observe the loading spinner being displayed for an inordinate amount of time even if no other activity or user-specific sessions are present on the cluster. Note: you may have to refresh a few times to hit it b/c it's not always consistent.
    • Debug logs
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] ACS: https://proxy.example.com:443/v1/webapi/saml/acs services/saml.go:100
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] SSO: https://portal.sso.us-east-2.amazonaws.com/saml/assertion/Nzc5Mzg4NzQ2ODMzX2lucy0yZTk3MTAzOWFjYmYxYmY4 services/saml.go:98
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] Issuer: https://portal.sso.us-east-2.amazonaws.com/saml/assertion/Nzc5Mzg4NzQ2ODMzX2lucy0yZTk3MTAzOWFjYmYxYmY4 services/saml.go:99
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] ACS: https://proxy.example.com/v1/webapi/saml/acs services/saml.go:100
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] SSO: https://sso.jumpcloud.com/saml2/teleport services/saml.go:98
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] Issuer: JumpCloud services/saml.go:99
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] ACS: https://proxy.example.com/v1/webapi/saml/acs services/saml.go:100
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] SSO: https://user.onelogin.com/trust/saml2/http-redirect/sso/43480663-1d59-4a1f-8146-803b1e097797 services/saml.go:98
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] Issuer: https://app.onelogin.com/saml/metadata/43480663-1d59-4a1f-8146-803b1e097797 services/saml.go:99
      May 06 15:28:53 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:53-04:00 DEBU             [SAML] ACS: https://rc.ateleport.ga:443/v1/webapi/saml/acs services/saml.go:100
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU             [SAML] Successfully fetched entity descriptor from "https://user.okta.com/app/xxx/sso/saml/metadata" services/saml.go:62
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU             [SAML] SSO: https://user.okta.com/app/user_teleport/xxx/sso/saml services/saml.go:98
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU             [SAML] Issuer: http://www.okta.com/xxx services/saml.go:99
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU             [SAML] ACS: https://proxy.example.com:443/v1/webapi/saml/acs services/saml.go:100
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU             [SAML] SSO: https://keycloak.ateleport.ga/auth/realms/master/protocol/saml services/saml.go:98
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU             [SAML] Issuer: https://rc.ateleport.ga/web/login services/saml.go:99
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU             [SAML] ACS: https://proxy.example.com/v1/webapi/saml/acs services/saml.go:100
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU [AUTH:1]    Server certificate cert(cc.ip-172-31-72-82-ec2-internal issued by ip-172-31-72-82-ec2-internal:1234). auth/middleware.go:313
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU [AUTH:1]    Server certificate cert(cc.ip-172-31-72-82-ec2-internal issued by ip-172-31-72-82-ec2-internal:1234). auth/middleware.go:313
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU [AUDIT]     SearchSessionEvents(2022-05-06 04:00:00 +0000 UTC, 2022-05-07 03:59:59.999 +0000 UTC, 5000) events/auditlog.go:1042
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU [RBAC]      Check access to role(tmp) rc(ip-172-31-18-91-us-east-2-compute-internal, labels=map[type:tc]) matchLabels=true, msg=matched, err=<nil> allow=map[*:[*]] rcLabels=map[type:tc] services/role.go:1358
      May 06 15:28:54 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:54-04:00 DEBU [RBAC]      Check access to role(Proxy) rc(ip-172-31-18-91-us-east-2-compute-internal, labels=map[type:tc]) matchLabels=true, msg=matched, err=<nil> allow=map[*:[*]] rcLabels=map[type:tc] services/role.go:1358
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [SSH:PROXY] Incoming connection 99.1.1.1:35118 -> 172.1.1.1:443 version: SSH-2.0-Go, certtype: "host" sshutils/server.go:466
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [PROXY:1]   Transport request: teleport-transport. reversetunnel/srv.go:607
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [PROXY:1]   Received out-of-band proxy transport request for @remote-auth-server []. reversetunnel/transport.go:207
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [PROXY:1]   Attempting to dial through tunnel with server ID "". reversetunnel/transport.go:375
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [PROXY:1]   Attempting to dial directly [172.1.1.1:3025]. reversetunnel/transport.go:392
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [PROXY:1]   Returning direct dialed connection to [172.1.1.13025]. reversetunnel/transport.go:398
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [PROXY:1]   Successfully dialed to @remote-auth-server "", start proxying. reversetunnel/transport.go:311
      May 06 15:28:56 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:56-04:00 DEBU [AUTH:1]    Server certificate cert(cc.ip-172-31-72-82-ec2-internal issued by ip-172-31-72-82-ec2-internal:1234). auth/middleware.go:313
      May 06 15:28:59 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:28:59-04:00 DEBU [RBAC]      Check access to role(Proxy) rc(ip-172-31-18-91-us-east-2-compute-internal, labels=map[type:tc]) matchLabels=true, msg=matched, err=<nil> allow=map[*:[*]] rcLabels=map[type:tc] services/role.go:1358
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] SSO: https://user.okta.com/app/user_teleport/xxx/sso/saml services/saml.go:98
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] Issuer: http://www.okta.com/xxx services/saml.go:99
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] ACS: https://proxy.example.com:443/v1/webapi/saml/acs services/saml.go:100
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] SSO: https://portal.sso.us-east-2.amazonaws.com/saml/assertion/xxx services/saml.go:98
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] Issuer: https://portal.sso.us-east-2.amazonaws.com/saml/assertion/xxx services/saml.go:99
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] ACS: https://proxy.example.com/v1/webapi/saml/acs services/saml.go:100
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] SSO: https://sso.jumpcloud.com/saml2/teleport services/saml.go:98
      May 06 15:29:01 rc-node-1 /usr/local/bin/teleport[17247]: 2022-05-06T15:29:01-04:00 DEBU             [SAML] Issuer: JumpCloud services/saml.go:99

gz#5050

deusxanima commented 2 years ago

Responding to @russjones comments from weekly prod sync: both customer and my tests were done on self-hosted clusters. No DynamoDB throttling observed in either case (in my cluster specifically I ran test against a single node with only 1 user, myself, and no other traffic, with 20 read/writes provisioned in Dynamo). This is the RBAC role I tested with:

kind: role
metadata:
  labels:
    description: this is a test
  name: tmp
spec:
  allow:
    app_labels:
      '*': '*'
    cluster_labels:
      '*': '*'
    db_labels:
      '*': '*'
    db_names:
    - '*'
    db_users:
    - '*'
    kubernetes_groups:
    - read
    - '{{interal.kubernetes_groups}}'
    kubernetes_labels:
      '*': '*'
    logins:
    - '{{internal.logins}}'
    - ubuntu
    node_labels:
      env: db
    rules:
    - resources:
      - access_request
      verbs:
      - list
      - read
      - update
      - delete
      - create
    - resources:
      - applications
      verbs:
      - list
      - read
      - update
      - delete
    - resources:
      - role
      verbs:
      - list
      - create
      - read
      - update
      - delete
    - resources:
      - auth_connector
      verbs:
      - list
      - create
      - read
      - update
      - delete
    - resources:
      - session
      verbs:
      - list
      - read
      where: contains(session.participants, user.metadata.name)
    - resources:
      - trusted_cluster
      verbs:
      - list
      - create
      - read
      - update
      - delete
    - resources:
      - event
      verbs:
      - list
      - read
    - resources:
      - user
      verbs:
      - list
      - create
      - read
      - update
      - delete
    - resources:
      - token
      verbs:
      - list
      - create
      - read
      - update
      - delete
    - resources:
      - remote_cluster
      verbs:
      - list
      - create
      - read
      - update
      - delete
    - resources:
      - node
      verbs:
      - list
      - create
      - read
      - update
      - delete
    - resources:
      - kube_service
      verbs:
      - delete
    windows_desktop_labels:
      '*': '*'
    windows_desktop_logins:
    - Administrator
    - '{{internal.windows_logins}}'
  deny: {}
  options:
    cert_format: standard
    client_idle_timeout: 1h0m0s
    desktop_clipboard: true
    enhanced_recording:
    - command
    - network
    forward_agent: true
    max_session_ttl: 8h0m0s
    port_forwarding: true
    record_session:
      desktop: true
version: v5

I'll gather the role file from customer environment as well. Is there any additional info that would be helpful in this instance?

deusxanima commented 2 years ago

Customer role:

kind: role
metadata:
  name: teleport-view
spec:
  allow:
    rules:
      - resources:
          - access_request
          - event
          - role
          - user
        verbs:
          - list
          - read
      - resources:
          - session
        verbs:
          - list
      - resources:
          - session
        verbs:
          - read
        where: contains(session.participants, user.metadata.name)
  deny: {}
  options:
    max_session_ttl: 12h0m0s
version: v4

Customer confirmed that removing the where conditional clears the issue. Same behavior observed in my cluster.

zmb3 commented 2 years ago

@Aharic this should be fixed now, can you confirm whether there is still an issue?

deusxanima commented 2 years ago

Confirmed in my env. Closing out.