Open reivilibre opened 1 year ago
Part of the difficulty is in choosing a worker to perform the re-sync for a room, ensuring that even after a crash/restart, exactly one worker will pick up the job of re-syncing that room again.
Can we piggyback off the sharding logic used for event persisters? (Is that sharded by room id?)
To some extent, but that means having a definitive list of workers which are nominated for the job. That's very possibly fine! (But just noting a consideration.)
We can use the cross-worker locking stuff that we implemented for handling inbound federation:
I think sharding the partial join stuff isn't something we need to worry about now TBH. We have a bunch of much busier streams that aren't sharded?
An enhancement of: #12994 (worker-mode support for Faster Remote Room Joins).
Instead of relying on the master to perform the re-syncing of the rooms, we should allow other workers to be involved. Part of the difficulty is in choosing a worker to perform the re-sync for a room, ensuring that even after a crash/restart, exactly one worker will pick up the job of re-syncing that room again. We should be mindful that in a hypothetical deployment, workers can be taken out of service — a room shouldn't be locked to one worker forever in case this happens, as that would mean the re-sync would never progress.