Closed dpulrichth closed 1 month ago
@ahus1 Can you think about any recent improvement that might be related to this issue?
@dpulrichth Can you confirm you don't have issues with clock synchronization? You could tweak the settings for that by allowing a certain clock skew in the realm settings.
@sschu I couldn't find a realm setting for clock skew. The only clock skew setting I know is the one related to external identity providers which is enabled and allows up to 30 seconds of clock skew.
All session related timeouts are on default values.
I suspect this problem is somehow related to the distributed infinispan caches. I used to see this problem very frequently when running just two keycloak instances. Once I scaled up to three instances I had a lot (95%+) fewer session-related errors.
I found this in the cache documentation (https://www.keycloak.org/server/caching#_configuring_caches_for_availability) (emphasis mine)
The default number of owners is enough to survive 1 node (owner) failure in a cluster setup with at least three nodes. You are free to change the number of owners accordingly to better fit into your availability requirements. To change the number of owners, open conf/cache-ispn.xml and change the value for owners=
for the distributed caches to your desired value.
I interpret this to mean that a Keycloak cluster MUST have at least three notes. Is this correct?
@dpulrichth You are right, the clock skew settings are only for external IDPs, I mixed this up.
With a distributed cache with two owners the date will be replicated to two nodes. With two nodes in the cluster, this should still survive a node failure (although after the failure the cluster is not complete until the node is back, but it should still work).
There are two things you could try to stay on two nodes a) switch to replicated caches instead of distributed caches. Their implementation is simpler so chances there is a problem with cache sync are a bit lower. b) Try to direct all traffic to only one node. This way a request should always see up-to-date data as it doesn't matter if it is synced to the other node fast enough. This doesn't fix your problem but you can at least confirm that the sync to the other node is the culprit.
@sschu Actually I'm fine with running more than two instances so for the moment I'm sticking with three.
If the underlying problem was the cache not being synced fast enough I would actually expect there to be more problems with three instances than with two as there's a higher chance that an instance would not be up to date. In fact, I am seeing far fewer errors - I haven't encountered any in the past seven days. Granted, this was with a low number of total sessions (~2500) but previously I encountered this with as many as one in three sessions even if there was just a handful of sessions in total.
I also didn't experience any problems with just a single instance which is as close to option b) as I can currently get.
Option a) sounds interesting but I'd rather try to stick to Keycloak's defaults to avoid unintended side effects.
Long story short I'll keep an eye on the cluster in the next couple of days to see if running with three nodes does in fact solve this problem. If it does I'd appreciate any pointers as to how the number of nodes may influence cache synchronization.
@pedroigor - no, I am not aware of changes.
@dpulrichth I don't know about the internal implementation of distributed caches in Infinispan so I can't tell you what the difference is between 2 and 3 nodes. The only thing I can tell you is that it works for us with replicated caches with two nodes. And we have been running that configuration in production for years and have not seen any side effects. Quite some time ago, when we were on distributed caches, it also worked with them. If you want to get closer to finding a cause for the distributed caches, you would have to experiment to direct traffic to single nodes only as I described.
We noticed the described issue after rolling deployments of our Keycloak instances. We were first using replicated cache for sessions and distributed cache for clientSessions. After using replicated cache for sessions and clientSessions we were not able to reproduce the issue. It seems that there is an relation between sessions and clientSessions (the following discussion suggests that: https://github.com/keycloak/keycloak/discussions/12788).
When there is an relation between the two caches, we might should document this relation in the guides. But maybe I'm totally wrong and it's just luck that we can't reproduce the error after changing our configuration.
@drohwer89 There is a 1 to n relation between sessions and clientsessions. The error Session doesn't have required client
indicates the session was found but the client session wasn't.
@dpulrichth It can be also good if you add the info about your timeouts - especially SSO Session IDle timeout, SSO Session Max, Access token lifespan, client session timeouts (if you override them) as well as any timeouts overriden at client level.
@mposolda All timeouts are Keycloak default values, I have not changed them in the realm settings or overriden for a particular clients. Current values are:
SSO Session Idle: 30 Minutes SSO Session Max: 10 Hours Access Token Tineout: 5Minutes Client Session Idle / Max: Blank, should default to standard SSO Session Idle (30 Minutes)
Thanks @dpulrichth disabling the "Revoke Refresh Token" option helped us. we hope this is getting fixed sooner.
This issue should be resolved with the work we did for Keycloak 25 to clean up the code. Keycloak should also work better once you enable persistent sessions which are a preview feature in KC25, and which are fully supported and enabled by default in KC26.
Thanks for reporting this issue, but there is insufficient information or lack of steps to reproduce.
Please provide additional details, otherwise this issue will be automatically closed within 14 days.
Due to lack of updates in the last 14 days this issue will be automatically closed.
Before reporting an issue
Area
oidc
Describe the bug
We're running Keycloak 22 in a cluster with two nodes. We're experiencing intermittent failures when redeeming valid refresh tokens with Keycloak responding with HTTP Status Code 400:
or
The latter error is also visible within the admin console which shows sessions without a client:
We found that we can work around this issue by disabling the "Revoke Refresh Token" setting for the affected realm. When we have this setting enabled, however, the issue appears frequently.
Version
22.0.3
Expected behavior
Refresh Tokens can be used. Sessions without clients shouldn't exist (I suppose. I'm not sure what purpose a session without a client could serve)
Actual behavior
Using refresh tokens fails intermittently. Sessions without clients are found in Keycloak.
How to Reproduce?
Unfortunately I cannot provide a more details on how to reproduce this issue. However, it does happen quite frequently and we can see this happening by checking our logs. On bad days it happens in about one third of all attempts to redeem refresh tokens.
Anything else?
This appears to have been around for a while. We found an issue (https://keycloak.discourse.group/t/session-lost-reference-to-the-client/15122, currently unavailable, but available in cache https://webcache.googleusercontent.com/search?q=cache:xumrWxrHm7wJ:https://keycloak.discourse.group/t/session-lost-reference-to-the-client/15122&cd=9&hl=de&ct=clnk&gl=de) which describes clients not being found in sessions as early as Keycloak 15. One user fixed this by implementing a fallback that looks up missing sessions in the database if they are not found in the cache. Perhaps the bug is related to sessions being evicted from the cache too early or caches not being in sync in a clustered setup.
There was a similar issue with offline sessions, which was also fixed by implementing a fallback to database, https://github.com/keycloak/keycloak/issues/21402