Refresh tokens need a grace period to deal with network errors and similar issues

NavindrenBaskaran commented 4 years ago

So basically, on Hydra if the refresh token is already used to request for a new set of tokens, that refresh token is invalidated and can't be used to request for a new set of tokens. It's a one time use only.

At the event, where the tokens were actually issued by Hydra but the response was not sent to the users because of some network issues, the users are generally stuck. Is there a way for us to not invalidate the refresh token immediately and have a refresh token grace period to perform a duplicate refresh requests with the same refresh token.

aeneasr commented 4 years ago

At the event, where the tokens were actually issued by Hydra but the response was not sent to the users because of some network issues, the users are generally stuck. Is there a way for us to not invalidate the refresh token immediately and have a refresh token grace period to perform a duplicate refresh requests with the same refresh token.

This is currently not possible, the client would need to talk to /oauth2/auth again. A response getting lost in the network can of course happen but is very unlikely and does not justify issuing multiple tokens as this would not only potentially exponentially explode the amount of tokens issued but also enable replay attacks.

imranismail commented 4 years ago

@aeneasr does it help to mitigate the replay attack if we set a low default before the old token is completely unusable.

And to avoid exploding the amount of token issued, we could return the same token that was lost during the network error.

Please, just correct me if I'm wrong.

aeneasr commented 4 years ago

@aeneasr does it help to mitigate the replay attack if we set a low default before the old token is completely unusable.

Lowering TTL of things is typically a good guard against replayability, but doesn't resolve it completely of course :)

And to avoid exploding the amount of token issued, we could return the same token that was lost during the network error.

While this is a valid strategy and also supported by some servers, it does not work in ORY Hydra. We do not store the full access tokens in the database but instead generate it from a strong /dev/urandom reader. What we store in the database is its HMAC-SHA256 hash which we can use to look up a token and see if it's still valid.

This feature prevents anyone with database access from issuing access and refresh tokens without also having access to the ORY Hydra secrets.system.

aeneasr commented 4 years ago

https://github.com/ory/hydra/issues/1928#issuecomment-678119025

github-actions[bot] commented 3 years ago

I am marking this issue as stale as it has not received any engagement from the community or maintainers in over half a year. That does not imply that the issue has no merit! If you feel strongly about this issue

open a PR referencing and resolving the issue;
leave a comment on it and discuss ideas how you could contribute towards resolving it;
open a new issue with updated details and a plan on resolving the issue.

We are cleaning up issues every now and then, primarily to keep the 4000+ issues in our backlog in check and to prevent maintainer burnout. Burnout in open source maintainership is a widespread and serious issue. It can lead to severe personal and health issues as well as enabling catastrophic attack vectors.

Thank you for your understanding and to anyone who participated in the issue! 🙏✌️

If you feel strongly about this issues and have ideas on resolving it, please comment. Otherwise it will be closed in 30 days!

aeneasr commented 3 years ago

Closed in error

seanhoughton commented 3 years ago

This issue also prevents Ory Hydra from being used in a clustered environment without added complexity, as mentioned in this issue: https://github.com/ory/fosite/issues/255

also see: https://datatracker.ietf.org/doc/html/rfc6819#section-5.2.2.3

Note: This measure may cause problems in clustered environments, since usage of the currently valid refresh token must be ensured. In such an environment, other measures might be more appropriate.

If two concurrent processes both attempt to refresh the token simultaneously one will always fail. This means either adding a global mutex to prevent concurrent refresh requests or adding retries. The mutex solution means more ops and hurts scalability while the retry solution adds complexity to the client code and creates a bunch of false-positive auth errors in logs.

A grace period would allow some hysteresis that would just need to last enough time to cover the "read token, check if expired, refresh token" code.

Auth0 and Okta solve this with a per-client configurable option to disable refresh token rotation entirely. An option to add a grace period would be even better.

mitar commented 2 years ago

Just to add another perspective here, maybe instead of using a grace period, idempotency keys is what should be implemented instead. That is not covered by any OpenID Connect standard though (there is a standardization draft for the header itself though), but it could be an opt-in extension, if you use idempotency key header, then you enable this behavior. An interesting thing about idempotency keys is that it is done at an API layer and would not require underlying changes to Fosite/Hydra storage.

aeneasr commented 2 years ago

Thank you for the idea - the concept looks really interesting. I'm not sure if it addresses the issue we have here though. The proposed standard prevents resending the same payload twice, as the idempotency key will only allow one request to pass.

We do have an idempotency key already! It's the refresh token! If you re-use the refresh token, the request fails.

The problem described here is that we want to be able to reuse the refresh token in certain scenarios. This might be the case in distributed applications, where it is difficult to get a shared state due to eventual consistency. Another example is that we have a race condition and send the same refresh token (idempotency key) twice, which ends up in rejecting the second request, and - following OAuth2 best practices - invalidates the refresh keys.

Hope this makes sense!

yangm97 commented 2 years ago

It seems like Google expects the old refresh token to keep working until "they have sent a request using the new token".

From Google Smart Home Test Suite Refresh Token Validation Test (emphasis mine):

Error: Refresh token has been rotated. This is not forbidden, however we do not see much benefit that rotating the refresh token can provide but the potential problem it has. We also tried to refresh token with the old refresh token after it has been rotated. Refresh didn't work, this means partner invalidated the old refresh token right after the rotation. Partners shall only invalidate the old refresh token after seeing we use the new one to ensure we got it successfully.

After this test fails the account link becomes broken. I can only assume Google threw away the new refresh token because of the test failure and it would use the new token under normal circumstances. More on account linking.

Google's OAuth2 implementation allows up to 50 refresh tokens to be valid concurrently. That's a lot of tokens and I'm not sure how they implement token reuse detection but I'm afraid having more than one valid refresh token at any given time may be required to avoid account link breakage.

aeneasr commented 2 years ago

Google expects the old refresh token to keep working until they have sent a request using the new token.

That only works if your infrastructure uses the introspection endpoint at Ory Hydra. That will often not be the case, especially when people use the JWT strategy. I think the grace period should sufficiently deal with this problem :)

mitar commented 2 years ago

But I think they have a point. I think the best combination would be:

You set grace to one hour.
If client uses the access token and sends it to you and you use introspection endpoint before one hour, you can safely invalidate the refresh token early.

So I think grace period works for the worst case, but for the one common case (when you do use introspection endpoint) we can increase security by invalidating early. So I do not think we loose anything by adding logic to introspection endpoints to invalidate old refresh tokens when they get a call associated with the new one.

mitar commented 2 years ago

The proposed standard prevents resending the same payload twice, as the idempotency key will only allow one request to pass.

I think it is the opposite: it allows multiple requests to have the same successful response, but it is only once acted on the request: other times it just returns recorded response for the first successful request.

We do have an idempotency key already! It's the refresh token! If you re-use the refresh token, the request fails.

This is exactly the opposite of what the idempotency keys provide. The behavior should be something like:

I make a request with idempotency key AAA and refresh token BBB.
If for whatever reason I do not get a response (connection drops, etc.) I can just retry the same request, with same idempotency key and refresh token.
No matter if it was the first or second request which in fact got to the server and server acted upon it, the refresh token is revoked when any request got to the server, even if the response did not get back.
If it was the first request which succeeded, the second request simply gets the same (successful) response back. The only reason why client would be sending the same request is because it did not get the response. So in this scheme, we just return recorded response again.

yangm97 commented 2 years ago

If client uses the access token and sends it to you and you use introspection endpoint before one hour, you can safely invalidate the refresh token early.

I think that could be a problem for clients running on "eventually consistent" databases. For instance, the Google Identity documentation, specifically states:

Immediate and fully consistent shared state both within, and between, your and Google's token handling systems cannot be guaranteed. [...]

Support multiple, concurrently valid access and refresh tokens. [...]

So the Google Smart Home Test Suite Refresh Token Validation Test recommendation of invalidating the old token as soon as you notice a request using the new token is actually in contradiction with this 🤷‍♂️

Speaking of that test, I can see the refresh token grace period feature from #2827 makes it pass, which is awesome, but still, if for whatever reason token reuse detection does kick in, without an alternative refresh token, I suppose the issue with the account link breaking would still occur.

Anyway I believe I'm drifting further away from the issue subject so I've opened a discussion if anyone is interest in continuing this conversation: https://github.com/ory/hydra/discussions/2946

tommyasai commented 2 years ago

@yangm97 May I ask how you passed the Account Linking test in Google Actions Test Suite? I saw your comment here and I am in the same situation. https://github.com/ory/hydra/issues/1831#issuecomment-990001973

In my understanding, we need to wait for #2827 to be released, but do you have any workaround in your mind?

yangm97 commented 2 years ago

@tommyasai I built a docker image from the source code of #2827 (HEAD was e992907a205ee89898ed9d500d212777a278159f at that time).

If you just need to test something quick and don't mind running a stale version of that PR (nor running binaries from strangers on the internet 😅) then sure I could push that docker image somewhere public.

If you intend to run this on production and are not comfortable enough with the codebase to compile a version yourself I would advise against running off tree. As a general rule, you're almost always guaranteed to get undocumented breaking changes which may be as easy to fix as running the down migration(s) from previous iterations and then applying the newer ones or in the worst case you might need to patch your way out a situation nobody else has been before. Scary.

PS: you can also find me on telegram, same username as here. While it is good to know people on the same boat as us I think we should avoid stealing the thread heh

tommyasai commented 2 years ago

@yangm97 I appreciate your comment. Now that I need to pass the test quickly to get the release approval, so If possible I'd like you to push the image. 🙏

We still have some time till the actual release for users, so as a workaround for production, we are also thinking about implementing the proxy to cover the functionality.

StarAurryon commented 2 years ago

Just a quick question : Is So the Google Smart Home Test Suite Refresh Token Validation Test recommendation of invalidating the old token as soon as you notice a request using the new token is actually in contradiction with this the current planned behaviour ?

I'm using Hydra in production on a 3 site cluster with JWT and access_token lifespan of 5m to easily revoke accesses in case of compromise. And in 1 week of usage on our Drive Sync Client on 30 beta testers, 2 clients already lost the new tokens due to crashes at the wrong moment or the os killing the app at shutdown.

aeneasr commented 2 years ago

Yes but it’s very difficult to implement securely

github-actions[bot] commented 1 year ago

Hello contributors!

I am marking this issue as stale as it has not received any engagement from the community or maintainers for a year. That does not imply that the issue has no merit! If you feel strongly about this issue

open a PR referencing and resolving the issue;
leave a comment on it and discuss ideas on how you could contribute towards resolving it;
leave a comment and describe in detail why this issue is critical for your use case;
open a new issue with updated details and a plan for resolving the issue.

Throughout its lifetime, Ory has received over 10.000 issues and PRs. To sustain that growth, we need to prioritize and focus on issues that are important to the community. A good indication of importance, and thus priority, is activity on a topic.

Unfortunately, burnout has become a topic of concern amongst open-source projects.

It can lead to severe personal and health issues as well as opening catastrophic attack vectors.

The motivation for this automation is to help prioritize issues in the backlog and not ignore, reject, or belittle anyone.

If this issue was marked as stale erroneously you can exempt it by adding the backlog label, assigning someone, or setting a milestone for it.

Thank you for your understanding and to anyone who participated in the conversation! And as written above, please do participate in the conversation if this topic is important to you!

Thank you 🙏✌️

RazerM commented 1 year ago

This issue is not stale

StarAurryon commented 1 year ago

I was waiting on an input on #2827 regarding code reuse, to know if I can adapt the code to hydra v2.

JohannesHuster commented 2 months ago

@aeneasr Would more/different work from the community be helpful here at the moment? And do you think this issue will be a priority in the near future? - Not meant to put any pressure on you. We're just discussing whether to wait on this or change quite a few things on our end to solve our problems; and would be grateful for a quick update 🙏

We don't have any Go programmers on the team, but are happy to help where we can (documentation, test cases?, ...)

ory / hydra

Refresh tokens need a grace period to deal with network errors and similar issues #1831