Open verymilan opened 1 year ago
I think this issue is mostly that "message retention is expensive and must be done on the main process"?
It seems like there's a few issues here though:
I'm unsure of the best approach to solve this, or if all 3 are somewhat needed.
We want to give our users a message retention promise, but its currently not possible on large servers with databases in the hundreds gigabytes.
Message retention purging being expensive.
I'm not sure this is solvable. Purging involves touching potentially lots of rows.
It seems to me there are a couple possibilities:
purge_history
function mentioned in the above issue to a place where workers can access it, thereby moving the purge job off the main processBut since I'm not as familiar with synapse's internals as the devs I can't say which is the most attractive. But this issue is very painful for us and its difficult to explain to stakeholders that despite this feature being well documented, it in fact breaks the server and has been like that for quite a while.
This recent commit removing the experimental warning from retention is not a good idea IMO. While it's fabulous there is no longer risk of corruption bugs, as this issue establishes, there is a serious bug in the retention feature as it exists: your server will essentially stop working (if you have a large number of events that need purging).
fb664 Remove warnings from the docs about using message retention.
Description
Assumingly due to https://github.com/matrix-org/synapse/pull/13632, the master process is unable to handle replication requests by workers due to the load from purge jobs. It is happily logging updates on the purge job states while clients can't connect anymore.
Steps to reproduce
Homeserver
tchncs.de
Synapse Version
1.94.0
Installation Method
pip (from PyPI)
Database
PostgreSQL
Workers
Multiple workers
Platform
Debian GNU/Linux 12 (bookworm), dedicated
Configuration
draupnir module, presence, retention
Relevant log output