mozilla / fxa-oauth-server

OAuth server for Firefox Accounts
48 stars 40 forks source link

Need a different strategy for purging expired oauth tokens #571

Closed jrgm closed 6 years ago

jrgm commented 6 years ago

The problem is that the current bin/purge-expired-tokens.js is 1) very slow, and 2) causes some number of failed oauth and profile api calls on timeouts.

So, perhaps something like a) selecting ~100000 rows in a date range, filitering out the pocket ids in memory, and then batching out DELETE FROM tokens where token in (...) and expires < now(), or b) dumping out all candidate tokens to disk (or queue) and batching from that list.

vladikoff commented 6 years ago

from mtg: for devops on tuesday

rfk commented 6 years ago

The problem is that the current bin/purge-expired-tokens.js is 1) very slow

Is it doing a full table scan or something?

rfk commented 6 years ago

Perhaps we should be pruning in a specific time window NOW() - window < expires < now() in a similar manner to what the auth-server does for sessionTokens:

https://github.com/mozilla/fxa-auth-db-mysql/blob/5c6622c41987ef62d2b40d60d569af5cb73571df/lib/db/schema/patch-068-069.sql#L26

jrgm commented 6 years ago

So, with EXPLAIN SELECT token FROM tokens WHERE clientId NOT IN (...) AND expiresAt < NOW(), it is a table scan.

But I had modified the DELETE to do expiresAt < '2016-12-01'.

With EXPLAIN SELECT token FROM tokens WHERE clientId NOT IN (...) AND expiresAt < '2016-12-01' LIMIT N, the expiresAt index is used with Using index condition; Using where.

If I drop the NOT IN, EXPLAIN SELECT token FROM tokens WHERE expiresAt < '2016-12-01' LIMIT N uses the expiresAt index with Using where; Using index.

But here's the thing. For 50K rows, the former query takes ~15 minutes; the latter query take 0.2 seconds. o_O.

When I run the DELETE there's an initial long phase where no writing happens; then an equally long phase when writing starts. The initial phase doesn't induce errors to oauth and profile (I think on timeouts). The second phase is where errors happen. While polling mysql -=e 'show full processlist', during the write phase I can see other DELETE's waiting to access a lock. It appears that sometimes they are acquired without a problem, but in other cases, the HTTP request will time out.

jrgm commented 6 years ago

Batching explicit `DELETE FROM tokens WHERE token IN (...) AND expiresAt < 'YYYY-mm-dd' would use the PRIMARY key, and avoid any gap-locking or next-key locking that leads to contention with other DELETE queries.

jrgm commented 6 years ago

And also what I said before: filtering out the clientIds in memory would be way faster than SQL filtering (but might have a problem of pulling up the oldest Pocket token ids every time, leading to no rows to delete after filtering them out).

jrgm commented 6 years ago

I have a branch that implements this. I just need to finalize tests.