Duplicate key value violates unique constraint

mpanne commented 4 years ago

Describe the bug

For our project we are using Hydra and a PostgreSQL database. Sometimes when trying to get an access token calling the /oauth2/token endpoint, Hydra returns an 500 Internal Server Error and logs the following error message:

level=error msg="An error occurred" debug="pq: duplicate key value violates unique constraint \"hydra_oauth2_access_request_id_idx\": Unable to insert or update resource because a resource with that value exists already" description="The authorization server encountered an unexpected condition that prevented it from fulfilling the request" error=server_error

Hydra seems to try to add an access token request to the database with a request id which already exists. What we could observe is that every time this happened multiple (2-3) token requests have been received in a very short time period (<100ms).

Did you experience this at some point or do you have any hint what could be the root cause for it? We are using an older version (v1.0.0) but I couldn't find anything regarding this in the changelog.

Reproducing the bug

We could not reproduce the bug not even when sending ~500 token requests (almost) simultaneously.

Environment

Version: v1.0.0
Environment: Docker, PostgreSQL DB

aeneasr commented 4 years ago

Never seen this before, could you set LOG_LEVEL=trace and print out the trace? That would help a lot! Please also include logs surrounding the error (so we can see what requests are happening etc).

It would also be interesting to see if this happens with the latest version.

mpanne commented 4 years ago

We will set the log level as requested and provide additional logs as soon as it happens again, thank you! We are also about to update to a newer Hydra version soon.

mpanne commented 4 years ago

Is log level trace even supported? As stated here, only debug is supported. And even when setting the log level to debug, we cannot observe any debug logs.

aeneasr commented 4 years ago

Yeah, that's a docs issue. trace should work but you can also use debug if you want. The error messages, if some occur, should be enhanced with stack traces.

mpanne commented 4 years ago

It happened again today. Following the logs enhanced with debug log. The request_id is randomly generated and added to the request headers by our nginx ingress controller.

time="2020-02-10T09:42:18Z" level=info msg="started handling request" method=POST remote=172.20.6.158 request=/oauth2/token request_id=16ebd03dde4752f4d6b551c05e885e06
time="2020-02-10T09:42:18Z" level=info msg="started handling request" method=POST remote=172.20.2.196 request=/oauth2/token request_id=59f3ef389a2228f18ed3fd996d10a2d8
time="2020-02-10T09:42:19Z" level=debug msg="Got an empty session in toRequest"
time="2020-02-10T09:42:19Z" level=debug msg="Got an empty session in toRequest"
time="2020-02-10T09:42:19Z" level=error msg="An error occurred" debug="pq: duplicate key value violates unique constraint \"hydra_oauth2_access_request_id_idx\": Unable to insert or update resource because a resource with that value exists already" description="The authorization server encountered an unexpected condition that prevented it from fulfilling the request" error=server_error
time="2020-02-10T09:42:19Z" level=debug msg="Stack trace:
github.com/ory/fosite/handler/oauth2.(*RefreshTokenGrantHandler).PopulateTokenEndpointResponse
    /go/pkg/mod/github.com/ory/fosite@v0.30.0/handler/oauth2/flow_refresh.go:164
github.com/ory/fosite.(*Fosite).NewAccessResponse
    /go/pkg/mod/github.com/ory/fosite@v0.30.0/access_response_writer.go:36
github.com/ory/hydra/oauth2.(*Handler).TokenHandler
    /go/src/github.com/ory/hydra/oauth2/handler.go:585
net/http.HandlerFunc.ServeHTTP
    /usr/local/go/src/net/http/server.go:2007
github.com/julienschmidt/httprouter.(*Router).Handler.func1
    /go/pkg/mod/github.com/julienschmidt/httprouter@v1.2.0/params_go17.go:26
github.com/julienschmidt/httprouter.(*Router).ServeHTTP
    /go/pkg/mod/github.com/julienschmidt/httprouter@v1.2.0/router.go:334
github.com/urfave/negroni.Wrap.func1
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:46
github.com/urfave/negroni.HandlerFunc.ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:29
github.com/urfave/negroni.middleware.ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
net/http.HandlerFunc.ServeHTTP
    /usr/local/go/src/net/http/server.go:2007
github.com/ory/hydra/x.RejectInsecureRequests.func1
    /go/src/github.com/ory/hydra/x/tls_termination.go:55
github.com/urfave/negroni.HandlerFunc.ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:29
github.com/urfave/negroni.middleware.ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/ory/x/metricsx.(*Service).ServeHTTP
    /go/pkg/mod/github.com/ory/x@v0.0.73/metricsx/middleware.go:261
github.com/urfave/negroni.middleware.ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/ory/hydra/metrics/prometheus.(*MetricsManager).ServeHTTP
    /go/src/github.com/ory/hydra/metrics/prometheus/middleware.go:26
github.com/urfave/negroni.middleware.ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/meatballhat/negroni-logrus.(*Middleware).ServeHTTP
    /go/pkg/mod/github.com/meatballhat/negroni-logrus@v0.0.0-20170801195057-31067281800f/middleware.go:136
github.com/urfave/negroni.middleware.ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/urfave/negroni.(*Negroni).ServeHTTP
    /go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:96
net/http.serverHandler.ServeHTTP
    /usr/local/go/src/net/http/server.go:2802
net/http.(*conn).serve
    /usr/local/go/src/net/http/server.go:1890
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1357"
time="2020-02-10T09:42:19Z" level=info msg="completed handling request" measure#hydra/public: https://***.latency=110682774 method=POST remote=172.20.2.196 request=/oauth2/token request_id=59f3ef389a2228f18ed3fd996d10a2d8 status=500 text_status="Internal Server Error" took=110.682774ms
time="2020-02-10T09:42:19Z" level=info msg="completed handling request" measure#hydra/public: https://***.latency=146429073 method=POST remote=172.20.6.158 request=/oauth2/token request_id=16ebd03dde4752f4d6b551c05e885e06 status=200 text_status=OK took=146.429073ms

aeneasr commented 4 years ago

This is still 1.0.0 right?

aeneasr commented 4 years ago

It appears that his is caused by two processes trying to refresh the access token at the same time, and because the transaction is not yet committed, one fails. It is actually intended that this results in an error condition (although this error isn't really a "good" symptomatic) because a refresh token is supposed to be refreshed only once. Otherwise one refresh token could be used to create two (three, four, five) tokens and so on.

aeneasr commented 4 years ago

I've scooped around some other projects. The general guideline is to have clients refresh tokens only once. This can be tricky sometimes though, when there is no shared state between the components. It appears that some providers implement a grace period (e.g. 60 seconds) during which the old refresh token returns the same response. I don't think it would solve your problem though, because the requests are apparently only nanoseconds apart - so fast that the first database transaction hasn't even completed yet.

I'd suggest you add a random tick (e.g. -10 ±rnd(0,5)s) to your clients to prevent synchronization issues. They can still occur of course, but should be much less common.

mpanne commented 4 years ago

Yes, it's still 1.0.0. Yes, we are aware of the fact, that the refresh token is only valid for a single refresh request. Unfortunately we do not own the clients requesting/refreshing the tokens so that we cannot implement the random tick on client side. We will try to clarify why the same refresh token is used twice almost simultaneously though. Maybe we can find a solution together with our customer. Thank you!

aeneasr commented 4 years ago

Noi problem, I'm pretty sure that's what's causing it :)

mpanne commented 4 years ago

Yes, that sounds reasonable. Could using a newer version fix this issue?

aeneasr commented 4 years ago

I don't think so. I've also looked into this a bit more. We can't reliably echo the same response twice because that would mean that we have to cache the full valid access token somewhere (e.g. database) which we explicitly have said we won't do.

What you could obviously try is reduce the latency between hydra and the DB - 130ms seems a lot to me.

aaslamin commented 4 years ago

What I am stumped by is how two requests with the same request ID were proxied to your Hydra nodes?

That should definitely not happen. What does your NGINX config look like for your Hydra routes? If I remember correctly, some old versions of NGINX would even retry non-idempotent upstream requests by default. Although I doubt you are on one of those versions.

With that said, I noticed you are not injecting UUID V4's which would make it very unlikely you will generate the same ID twice. Finally, double check you LB config to make sure a client cannot add a request ID header to their request.

aeneasr commented 4 years ago

@aaslamin great to see you here again :)

I checked the code, I think the problem is happening because we use the original request ID to track any refresh/access tokens over time. So what's happening is:

refresh token is fetched from store (request id is extracted)
old tokens are removed
new tokens are minted with extracted request id
new tokens are inserted into store

What I'm curios about is that the transaction apparently isn't helping here. Instead of a conflict it should give a 404 (refresh token not found). It might be because the first SELECT does not count lock the table (which is performance wise a good thing lmao) so we get into a state where 429 can happen.

I'm not sure if there's really a way for us to fix this. If we generated a new request ID, you would end up with two tokens. The only thing we could do is figure out how to return 404 instead of 429 but it would definitely touch some internals, potentially even individually for every database

aeneasr commented 4 years ago

This is causing the request ID to cause a 429:

https://github.com/ory/fosite/blob/master/handler/oauth2/flow_refresh.go#L159-L160

This is the "fetch from store" I was referring to:

https://github.com/ory/fosite/blob/master/handler/oauth2/flow_refresh.go#L135-L140

Actually wondering now if a rollback could cause the old refresh token to come back into the store now...

aeneasr commented 4 years ago

Looks like this can be solved by changing the IsolationLevel: https://golang.org/pkg/database/sql/#IsolationLevel

Which can be set for transactions: https://golang.org/pkg/database/sql/#TxOptions

So basically by picking another isolation level, this wouldn't return 429 but instead 404 (or rather 401/403 as the appropriate error type)

aeneasr commented 4 years ago

Yeah I think by setting the isolation level to LevelRepeatableRead the 429 should go away

aeneasr commented 4 years ago

Last comment for now - marking this as a bug :)

aaslamin commented 4 years ago

@aaslamin great to see you here again :)

🍻 cheers!

The only thing we could do is figure out how to return 404 instead of 429

You mean a 409 (Conflict), right? It's currently returning a 5xx error.

Yeah I think by setting the isolation level to LevelRepeatableRead the 429 should go away

I think you are right 🙏

• The default isolation level in MySQL using InnoDB as the storage engine (most common I believe) is Repeatable Read • Postgres only has one storage engine (as opposed to nine in the MySQL 🌎...) and it's default isolation level is Read Committed

The behavioral difference in this isolation mode is significant:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the transaction start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the repeatable read transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the repeatable read transaction will be rolled back with the message: ERROR: could not serialize access due to concurrent update

Source: https://www.postgresql.org/docs/12/transaction-iso.html

For a fix, I would love to see:

A test case that brings up PostgresSQL instance and does a stress test on the refresh token flow using both isolation levels:
- Repeatable Read
- Read Committed (just to prove that you would run into an error w/ this isolation level)
It may be a bit gnarly to write this test, but basically you would need to:
- Prior to the start of the test, seed the test DB so there is an active refresh token session
- Have N refresh token requests lined up ready to go
- Fire all requests concurrently and let them fight it out 👊

This brings up another point. Ideally, any first class supported database in Hydra should have full support for the Repeatable Read isolation level - perhaps this can be a mandatory requirement if someone in the future wants to add another implementation of the fosite storage interfaces for a new DB. More accurately, any database added to Hydra that wishes to implement fosite's transactional interface should support this isolation level.

aeneasr commented 4 years ago

Yeah, 409! :)

I think Repeatable Read has an acceptable trade-off for refreshing the tokens. Since that operation doesn't happen often for a specific row (which if I understood correctly will be locked, not the entire table).

I think the test case makes sense. One refresh token should be enough though. You can probably fire n go routines that try to refresh the token concurrently and you should definitely see the error.

Regarding isolation - this really is only required for things where a write is dependent on a previous read. I'm not sure if setting RepeatableRead for every transaction is performant? But then again, the performance impact is probably minimal while the consistency is much better. Maybe MaybeBeginTx should always request RepeatableRead isolation.

aaslamin commented 4 years ago

Maybe MaybeBeginTx should always request RepeatableRead isolation.

That's an interesting approach, although currently fosite doesn't have understanding of what an isolation level is. Maybe it should? :grinning:

Perhaps, before BeginTX is called, fosite could inject a recommended/desired isolation level under a unique key to the proxied context. We could even use the types defined in the SQL package as the context values although that may make it too tightly coupled 🤷‍♂

Finally, as a best practice, the authorization service that is fosite compliant should check for this key and configure its TXOptions accordingly prior to creating the transaction.

This way we can have flexibility in isolation levels for various flows instead of going all out w/ RepeatableRead which as you pointed out could impact performance.

aeneasr commented 4 years ago

That's an interesting approach, although currently fosite doesn't have understanding of what an isolation level is. Maybe it should? 😀

Yeah, I think that's ok. There's actually so much implicit knowledge already when it comes to the storage implementation that I gave up on imposing that in fosite to be honest.

I think setting the appropriate isolation level in begintx in hydra would be a good solution, plus having a failing test to verify the implementation.

ory / hydra

Duplicate key value violates unique constraint #1719