Major reduction in performance upgrading to 2.7.4 using AWS cluster

aashley commented 5 years ago

What happened: Upgraded our production cluster from 2.6.7 to 2.7.4, seeing major reduction in performance of the Auth component reducing entire system to unusable. Test cluster didn't see issues.

Cluster is main cluster with ~250 single node trusted clusters. Cloud infrastructure is based on example terraform scripts from Teleport. DynamoDB storage, S3 audit logs, Auth nodes behind network LB, Proxy nodes behind network LB.

Same cluster on 2.6.4 handled the load no issue, upgrade to 2.7.4 has brought the system to its knees, logins to all nodes timeout.

What you expected to happen: System to work as before.

How to reproduce it (as minimally and precisely as possible): Hard to say, upgrade on smaller test cluster with 5 trusted clusters worked fine with no issue.

Environment:

Teleport version (use teleport version): Teleport v2.7.4 git:v2.7.4-0-g2fff1056
Tsh version (use tsh version): Teleport v2.7.4 git:v2.7.4-0-g2fff1056
OS (e.g. from /etc/os-release): Main Cluster: Debian 9. Remote Clusters: Ubuntu 16.04.4

Browser environment

Browser Version (for UI-related issues):
Install tools:
Others:

Relevant Debug Logs If Applicable

tsh --debug
teleport --debug

klizhentas commented 5 years ago

What do you observe?

Can you post CPU/Disk io/RAM output on the auth server? Do you see anything in the logs? Do you see any rate limiting on the DynamoDB side? (Check cloudwatch metrics).

250 node cluster should work fine without any notable differences between 2.6 and 2.7, if anything we made 2.7 faster, so this is unusual.

klizhentas commented 5 years ago

ah, it's 250 trusted clusters, my first bet is that CA rotation heartbeats will put load on the dynamodb because they added polling, can you check there.

klizhentas commented 5 years ago

Anyways, I can help you troubleshoot the problem, if you want I can jump on a chat/call with you tomorrow. Meanwhile if it's a production you probably need to downgrade now.

Just send me an email to sasha@gravitational.com re this ticket and we can schedule some time.

aashley commented 5 years ago

At the worst we had 8 m5.xlarge nodes running the auth cluster with 20,000 units of provisioned read capacity in the dynamodb table, we where still getting throttling on the dynamo requests. In fact it seemed the more Dynamodb capacity we provisioned the more throttled requests we got. See https://i.imgur.com/FNqjO4I.png

CPU and memory wise we had the Auth servers running at 100% capacity and about 60% memory usage. Auth server CPU: https://i.imgur.com/IbHRF1r.png The way the cluster is setup there is zero disk IO and the network IO averaged about 120Mbps. The proxy servers where at about 70% utilisation with no disk IO and similar network usage.

Roll back has been done for the absolutely critical services and the system split in two so I have still have a 2.7.4 cluster exhibiting similar issues just not to the magnitude of the original issue.

aashley commented 5 years ago

Oh also, on here is probably best, we're based in Perth, Western Australia, so its just on 9am here.

klizhentas commented 5 years ago

ok, thanks for the info. I will try to reproduce this week and get back to you with my findings. Meanwhile, can you give me all the specs and steps to reproduce this, so I can try and see what's going on.

BTW, if you have time, you can active --diag-addr endpoint and collect some metric dumps for me to see what's bothering auth server so much.

klizhentas commented 5 years ago

To get debug CPU and RAM profiles for me:

teleport start -d --diag-addr=127.0.0.1:6060
curl http://127.0.0.1:6060/debug/profile
curl http://127.0.0.1:6060/debug/heap

aashley commented 5 years ago

Auth: m5.xlarge Proxy: m5.large

Setup is as per example terraform scripts for AWS cluster, modified to install the OSS version and not to run any nodes or grafana/influx server. The cloud instance is just for user auth and providing the trusted endpoint. Each of the remote clusters is an all in one node with the following config:

teleport:
  auth_servers:
  - 127.0.0.1:3025
  data_dir: "/var/lib/teleport"
auth_service:
  enabled: 'yes'
  cluster_name: od-server-00296
  listen_addr: 0.0.0.0:3025
  session_recording: proxy
  tokens:
  - proxy,node:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ssh_service:
  enabled: 'yes'
  labels:
    role: proxy
    env: dev
  commands:
  - name: hostname
    command:
    - "/bin/hostname"
    period: 1m0s
  - name: arch
    command:
    - "/bin/uname"
    - "-p"
    period: 1h0m0s
proxy_service:
  enabled: 'yes'
  https_key_file: "/etc/ssl/private/od-server-00296.key"
  https_cert_file: "/etc/ssl/certs/od-server-00296.pem"

Process:

Cluster was installed new at 2.6.3, upgrade to 2.6.4. Running with 2 proxies, 2 auth
Shutdown proxies
Shutdown all but one auth
Upgrade binaries on running auth, restart teleport
Confirm upgrade complete, upgrade and start second auth
Upgrade and start proxies
Upgrade two remote clusters for testing

I'll see about getting those files directly. We do have the diag running and pushing to our influxdb cluster permenantly, you can view a snapshot of the last 24 hours at https://snapshot.raintank.io/dashboard/snapshot/Q2Q1fvvuoFVeE3Py71EVo8fLF79qre4x the data was a bit intermittent at the worst of the issue.

klizhentas commented 5 years ago

I was able to reproduce, landed a couple of patches in 2.7, but work is still in progress - have a couple of ideas to test in the next couple of days

klizhentas commented 5 years ago

I have landed several patches that improve performance in the scenario mentioned above. Due to the nature of the patches, they will be available post 3.0 release in this branch:

https://github.com/gravitational/teleport/pull/2243

Some notes:

2.6 was never handling 250 trusted clusters well either, it exhibits the same behavior as 2.7
I reproduced exactly the behavior you described and made sure this branch fixes it - reads to the database on 250 clusters are below 100/sec.

I recommend you try this branch in a dev cluster and communicate results back to me

aashley commented 5 years ago

Yeah I noticed 2.6 had the same problem when we rolled back, it was getting slower adding each extra node and then the massive reconnect of all the remote clusters reconnecting after the upgrade slammed it.

Just to confirm to see the improvement I'd only need to upgrade the main cluster not all the remote ones. Or do I need to upgrade all the remote ones before I see the improvement?

klizhentas commented 5 years ago

I only tested the scenario when both nodes and the cluster were on 3.0+ my branch, so not sure. I will leave the rest for you to verify on your own.

aashley commented 5 years ago

Finally had time to test this myself, looks good so far.

gravitational / teleport

Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236