travisghansen / external-auth-server

easy auth for reverse proxies
MIT License
330 stars 44 forks source link

EAS stops working every two weeks #155

Closed juissi-t closed 2 years ago

juissi-t commented 2 years ago

We installed external-auth-server a couple of months back to authorize requests using Firebase tokens. The service seems to be working fine when it is first started, but somewhere around two weeks time, verifying starts to fail.

external-auth-server-6f57d4ddc6-xtblv external-auth-server {"level":"info","message":"starting verify pipeline","service":"external-auth-server","timestamp":"2022-08-16T21:37:30.592Z"}
external-auth-server-6f57d4ddc6-xtblv external-auth-server {"level":"info","message":"starting verify for plugin: firebase_jwt","service":"external-auth-server","timestamp":"2022-08-16T21:37:30.592Z"}
external-auth-server-6f57d4ddc6-xtblv external-auth-server {"level":"info","message":"end verify pipeline with status: 401","service":"external-auth-server","timestamp":"2022-08-16T21:37:30.593Z"}

Requests might not happen every day, the previous request was five days earlier and that succeeded.

I tried increasing the log verbosity to verbose to see if something is seen there, but because I don't have a better way to trigger the issue than to wait, it might take two weeks to find out if that shows anything. In the mean while, are there any ideas what I could do to debug the problem?

travisghansen commented 2 years ago

That seems like a pretty odd timing doesn’t it?

Do you have a copy of the actual response to the client?

A 401 generally means the user was not identified vs the token being expired or similar which would result in a 403. That leads me to believe something quite early in the process is failing.

juissi-t commented 2 years ago

The developer got this (URL sanitized):

www-authenticate: Bearer realm="https://api.xyz.com/v1/jobs/testing-xyz", error="JsonWebTokenError", error_description="secret or public key must be provided"

After restarting the pod, the same Firebase token works fine.

juissi-t commented 2 years ago

I wonder if there could be some issue in fetching the Firebase public keys after the cache TTL has been exceeded? We are using the in-process memory cache.

After restarting with verbose logging, it looks like the public keys are loaded successfully during the first request.

external-auth-server-7f4bfb5794-sfrn6 external-auth-server {"level":"verbose","message":"firebase-jwt: fetching public signing keys","service":"external-auth-server","timestamp":"2022-08-17T17:46:33.043Z"}

On subsequent requests, cached keys are used.

external-auth-server-7f4bfb5794-sfrn6 external-auth-server {"level":"verbose","message":"firebase-jwt: using cached public signing keys","service":"external-auth-server","timestamp":"2022-08-17T17:46:56.305Z"}

I will monitor the instance to see if the cached keys start to fail at some point.

travisghansen commented 2 years ago

Yes that seems the likely candidate but if the key is no longer in the cache it should be transparently refetched.

juissi-t commented 2 years ago

Verbose log for a request which fails in the same fashion:

2022-08-30T18:28:17.775616090Z {"level":"info","message":"starting verify pipeline","service":"external-auth-server","timestamp":"2022-08-30T18:28:17.775Z"}
2022-08-30T18:28:17.778562352Z {"level":"info","message":"starting verify for plugin: firebase_jwt","service":"external-auth-server","timestamp":"2022-08-30T18:28:17.778Z"}
2022-08-30T18:28:17.778703266Z {"level":"verbose","message":"parent request info: {\"uri\":\"https://xyz.asdf.com/v1/jobs/testing-qwerty-l6badvqyxeu6\",\"parsedUri\":{\"scheme\":\"https\",\"host\":\"xyz.asdf.com\",\"path\":\"/v1/jobs/testing-qwerty-l6badvqyxeu6\",\"reference\":\"absolute\"},\"parsedQuery\":{},\"method\":\"GET\"}","service":"external-auth-server","timestamp":"2022-08-30T18:28:17.778Z"}
2022-08-30T18:28:17.779063545Z {"level":"verbose","message":"firebase-jwt: using cached public signing keys","service":"external-auth-server","timestamp":"2022-08-30T18:28:17.778Z"}
2022-08-30T18:28:17.779786442Z {"level":"info","message":"end verify pipeline with status: 401","service":"external-auth-server","timestamp":"2022-08-30T18:28:17.779Z"}

HTTP response:

< HTTP/2 401
< date: Tue, 30 Aug 2022 18:28:17 GMT
< content-type: text/html
< content-length: 172
< www-authenticate: Bearer realm="https://xyz.asdf.com/v1/jobs/testing-qwerty-l6badvqyxeu6", error="JsonWebTokenError", error_description="secret or public key must be provided"
...

A request using the exact same token works after EAS has been restarted.

Logs also say that it's fetching the public keys only once during the pod lifetime:

2022-08-17T17:46:33.043203133Z {"level":"verbose","message":"firebase-jwt: fetching public signing keys","service":"external-auth-server","timestamp":"2022-08-17T17:46:33.043Z"}

After that, it only ever uses the cached public keys. I checked Google keys and their lifetime is about 16 days. Could there be some sort of an issue in the key cache, that it doesn't invalidate the old keys or does not fetch new ones when the old ones have expired?

travisghansen commented 2 years ago

Yeah, indeed it says it's using the cached data, but then the check has no data so something seems off for sure. I checked a few days ago and it seemed like the logic was sane based on what I understand the return values to be from the cache'ing mechanism but clearly something is off :(

ryanandonian commented 2 years ago

I think I might have discovered why. This looks like it might be ignoring the cache_ttl value when calling store.set. The header detection and coercion logic is sound, but then it's just abandoned.

https://github.com/travisghansen/external-auth-server/blob/e4646e151ca9062e0ace68b748fcd8d7c9c7471c/src/plugin/firebase/index.js#L140-L142

This seems like it would set the cached item with no TTL

travisghansen commented 2 years ago

Do the keys actually rotate that frequently with firebase?

Also, how frequently do you get a new token? is the same token being used the whole time or does that get refreshed externally and if so at what frequency?

ryanandonian commented 2 years ago

According to Google's docs, the public key pairs live for <2 weeks before being rotated.

Google-managed key pairs are automatically rotated and used for signing for a maximum of two weeks. The rotation process is probabilistic; usage of the new key will gradually ramp up and down over the key's lifetime.

Our own app's JWTs live for only 1 hour, but usually we use firebase's built in APIs to get them (the caller originates in a Google Cloud Function) so they're likely refreshed each time they get close to expiration.

travisghansen commented 2 years ago

Ah, then indeed that would explain it! Do you want to test with the patch I sent or just want me to snap a new release and give it a go?

ryanandonian commented 2 years ago

Since it usually doesn't manifest for at least ~10-12 days, a release would probably be easiest for us to validate, given we can deploy a published version relatively easily, and then wait :)

travisghansen commented 2 years ago

I just pushed an update to the next tag (it's mutable so make sure your servers pull the most recent content). Give that a try and we'll see how it goes.

ryanandonian commented 2 years ago

We deployed the next tag this morning and it appears to have successfully resolved the cache-not-being-refreshed issue (short logs below, with some important annotations marked with #).

2022-08-31T17:10:38.606103647Z: starting verify pipeline
2022-08-31T17:10:38.618415946Z: starting verify for plugin: firebase_jwt
# at this moment the `max-age` in the Google API response header was ~<5.2 hours (max-age=18671)
2022-08-31T17:10:38.622404357Z: firebase-jwt: fetching public signing keys
2022-08-31T17:11:02.839835564Z: starting verify pipeline
2022-08-31T17:11:02.840631614Z: starting verify for plugin: firebase_jwt
# successful cache hit
2022-08-31T17:11:02.841431074Z: firebase-jwt: using cached public signing keys: {..<XXXXX>...}
2022-08-31T17:13:16.581926810Z: starting verify pipeline
2022-08-31T17:13:16.582974306Z: starting verify for plugin: firebase_jwt
# successful cache hit
2022-08-31T17:13:16.583657704Z: firebase-jwt: using cached public signing keys: {..<XXXXX>...}
2022-08-31T22:48:16.739490351Z: starting verify pipeline
2022-08-31T22:48:16.742564637Z: starting verify for plugin: firebase_jwt
# successful stale cache detected, re-fetching
2022-08-31T22:48:16.743650854Z: firebase-jwt: fetching public signing keys
2022-08-31T22:48:34.519818637Z: starting verify pipeline
2022-08-31T22:48:34.520214247Z: starting verify for plugin: firebase_jwt
# successful cache hit
2022-08-31T22:48:34.520614187Z: firebase-jwt: using cached public signing keys: {..<XXXXX>...}

While it will ultimately take "a maximum of two weeks" for these to be rotated out and prove the fix beyond a reasonable doubt, I am pretty comfortable with making the claim that this was the underlying problem with our issue.

I believe here's a good deep dive summary of the issue: the kid gets rotated out over time, so when the decoding process attempted to grab the public key from Google's API, the expected kid wasn't in the response (as new JWTs are signed by new kids, it no longer was available in the googleapi JSON response). That would explain why we were getting that "secret or public key missing" error state, and not a different error like "expired key" or "expired token". Digging into jsonwebtoken that error message only shows up if it can't find the signing keys, which seems to be what I discovered.

travisghansen commented 2 years ago

Yeah, that sounds sane. It brings up a subsequent question, can we cache them at all? I'm unclear exactly how the key rotation process is designed by firebase so are there scenarios where a new key is issueing certs after we've retrieved the current set and before the cache expires?

ryanandonian commented 2 years ago

I gave Google's documentation a re-read while investigating the kid stuff and they guarantee it is safe to cache up to 24h (although they recommend "at most 24 hours").

If you download and cache the public key, we recommend caching it for at most 24 hours to ensure that you always have the current key. Regardless of when you retrieve the public key, it will be valid for at least 24 hours after you retrieve it.

Since this TTL is dependent on their own cache headers, I believe it is safe to assume they won't send us bad (eg: >24h) max-age response headers. It could make sense to add some restrictions in the get-TTL function, but that seems like it might be unnecessary overkill, and would require maintaining if they wind up changing that paradigm down the line (eg: if they start sending max-age=3600 by default and update their docs to reflect that, the check would become sub-optimal).

travisghansen commented 2 years ago

Ok I’ll do a little more digging to see what I can find out as well. I guess I’m still cloudy on how the 24 hours relates to the 2 week stuff, but hopefully I can figure something sane out.

Do you know what the max validity they allow on a token is? How long do the tokens last that you are currently issuing?

ryanandonian commented 2 years ago

I'm not quite sure why they use both 24h and "<=2 weeks". My guess is that's some Goldilocks zone of "pretty secure" and also "not a pain to rotate", but my guess is only some PM at Firebase would know that answer for certain.

According to their docs, the maximum custom signed token lifetime allowed is 3600 seconds (ours are using that maximum), and their own SDK generates tokens that expire after 1 hour

travisghansen commented 2 years ago

Ah ok, well that seems like it would likely be safe to cache until max-age then. I'll snap a new release and then we can sit and wait :D

travisghansen commented 2 years ago

https://github.com/travisghansen/external-auth-server/pull/157

v0.12.4 tag built.

ryanandonian commented 2 years ago

FYI been running that first next tag you built for the last 5 days, keep seeing the positive case (cache busting as appropriate, all is good). So far all signs point to this being a valid fix. When they rotate the key IDs that should also be fixed, given it'll no longer be caching the stale public keys. Want to close my PR #156 for recordkeeping (since you merged in the fix in #157) or you want me to?

travisghansen commented 2 years ago

Go ahead and close it down yeah and we can close this now as well.

Thanks for the report and patience getting it fixed! Not to mention the code to fix it!

ryanandonian commented 2 years ago

Since @juissi-t was the one who created this Issue, I can't mark it as closed, but it should be good to close as Fixed by #156 & #157 if you don't want to wait for them to get back from vacation :)