keycloak / keycloak-benchmark

Keycloak Benchmark
https://www.keycloak.org/keycloak-benchmark/
Apache License 2.0
132 stars 74 forks source link

Why 20k clients is so slow #1018

Closed pruivo closed 3 weeks ago

pruivo commented 4 weeks ago

Cache usage: 2x in users cache, 3x in realms cache per client.

-> 10k clients

-> reduce RPS to a number where the DB is no longer overload -> reduce to 50% to not overload the database

-> Write some docs on how to size users and client cache based on the number of clients

Follow-up task: update the sizing guide for RHBK26

pruivo commented 4 weeks ago

Took a quick look inside the realm cache and it should require 4 entries per client

<client-uuid>.optional.clientscopes -> ClientScopeListQuery
<client-uuid>.default.clientscopes ->  ClientScopeListQuery
realm-0.client.query.by.clientId.<client-id> -> ClientListQuery
<client-uuid> -> CachedClient

There are other values stored in the realms cache as follows:

CachedClientRole
CachedClient
CachedGroup
CacheRealmRole
CachedCount
ClientScopeListQuery
ClientListQuery
RealmListQuery

I'm wondering why users got their cache but clients didn't 🤔

pruivo commented 4 weeks ago

Nano optimizations (food for though)

A single query to fetch the default and optional scopes and store them in ClientScopeListQuery. Not only save 1 cache entry but 1 database access too.

Clients are only cached on the second access. When the cache entry realm-0.client.query.by.clientId.<client-id> -> ClientListQuery is created, it does not cache the client. Why not!? It is loaded from the database and can be cached right away. The second access will cache it.

The cache entry realm-0.client.query.by.clientId.<client-id> -> ClientListQuery could be removed if the client's UUID was generated from realm+client-id. It would break existing databases and can never be implemented 😢

pruivo commented 3 weeks ago

Configuring Gatling to use only 10k clients

Summary Using half of the client brings the DB usage to ~65% and the 99% response time to ~500ms. Cache hit ratio around ~75%

Command Line

./benchmark.sh "eu-west-1" --scenario="keycloak.scenario.authentication.ClientSecret" --server-url="***" --users-per-sec=1000 --measurement=600 --realm-name=realm-0 --logout-percentage=100 --users-per-realm=20000 --clients-per-realm=10000 --ramp-up=20 --log-http-on-failure --refresh-token-count=0 --refresh-token-period=0 --sla-error-percentage=0.001

Database Usage db

Gatling Results gatling

Keycloak Response Times kc

Caches Hit Ratio caches

pruivo commented 3 weeks ago

@ahus1 @mhajas Do we want to reduce the RPS to reduce the load on the database below 50%? What is the target 99% response time that you have in mind?

pruivo commented 3 weeks ago

10k Client and 750 RPS (25% reduction)

Summary DB usage dropped to ~50% and the 99% response time to 77ms. The cache hit ratio is ~75%.

Database Usage db

Gatling Results gatling

Keycloak Response Times kc

ahus1 commented 3 weeks ago

Thanks, these numbers look good. Still one thing that concerns me: There is a http status code 401 there meaning access denied. This is unexpected. Can you please have a look? If there is a 401, this could mean the wrong password or the client doesn't exist or something different. This might then lead to a very different load pattern. Thanks!