Closed jrgm closed 6 years ago
from mtg: for devops on tuesday
The problem is that the current bin/purge-expired-tokens.js is 1) very slow
Is it doing a full table scan or something?
Perhaps we should be pruning in a specific time window NOW() - window < expires < now()
in a similar manner to what the auth-server does for sessionTokens:
So, with EXPLAIN SELECT token FROM tokens WHERE clientId NOT IN (...) AND expiresAt < NOW()
, it is a table scan.
But I had modified the DELETE to do expiresAt < '2016-12-01'
.
With EXPLAIN SELECT token FROM tokens WHERE clientId NOT IN (...) AND expiresAt < '2016-12-01' LIMIT N
, the expiresAt index is used with Using index condition; Using where
.
If I drop the NOT IN, EXPLAIN SELECT token FROM tokens WHERE expiresAt < '2016-12-01' LIMIT N
uses the expiresAt index with Using where; Using index
.
But here's the thing. For 50K rows, the former query takes ~15 minutes; the latter query take 0.2 seconds. o_O.
When I run the DELETE there's an initial long phase where no writing happens; then an equally long phase when writing starts. The initial phase doesn't induce errors to oauth and profile (I think on timeouts). The second phase is where errors happen. While polling mysql -=e 'show full processlist'
, during the write phase I can see other DELETE's waiting to access a lock. It appears that sometimes they are acquired without a problem, but in other cases, the HTTP request will time out.
Batching explicit `DELETE FROM tokens WHERE token IN (...) AND expiresAt < 'YYYY-mm-dd' would use the PRIMARY key, and avoid any gap-locking or next-key locking that leads to contention with other DELETE queries.
And also what I said before: filtering out the clientIds in memory would be way faster than SQL filtering (but might have a problem of pulling up the oldest Pocket token ids every time, leading to no rows to delete after filtering them out).
I have a branch that implements this. I just need to finalize tests.
The problem is that the current bin/purge-expired-tokens.js is 1) very slow, and 2) causes some number of failed oauth and profile api calls on timeouts.
So, perhaps something like a) selecting ~100000 rows in a date range, filitering out the pocket ids in memory, and then batching out DELETE FROM tokens where token in (...) and expires < now(), or b) dumping out all candidate tokens to disk (or queue) and batching from that list.