replication timeouts due to message retention purge jobs

verymilan commented 1 year ago

Description

Assumingly due to https://github.com/matrix-org/synapse/pull/13632, the master process is unable to handle replication requests by workers due to the load from purge jobs. It is happily logging updates on the purge job states while clients can't connect anymore.

Steps to reproduce

enable message retention
maybe be a big instance idk
wait for the scheduled job to execute

Homeserver

tchncs.de

Synapse Version

1.94.0

Installation Method

pip (from PyPI)

Database

PostgreSQL

Workers

Multiple workers

Platform

Debian GNU/Linux 12 (bookworm), dedicated

Configuration

draupnir module, presence, retention

Relevant log output

synapse.replication.tcp.client - 352 - INFO - _process_incoming_pdus_in_room_inner-124023-$fbrT_6mck678v_gNV527V0f5Jp4kvbDiQVSeHOmiN2E - Finished waiting for repl stream 'events' to reach 361593234 (event_persister1)
synapse.http.client - 923 - INFO - PUT-890470 - Received response to POST synapse-replication://master/_synapse/replication/fed_send_edu/m.receipt/IjFSBKBxIa: 200
synapse.replication.tcp.client - 332 - INFO - PUT-890470 - Waiting for repl stream 'caches' to reach 416737455 (master); currently at: 416710210
synapse.replication.tcp.client - 342 - WARNING - PUT-890464 - Timed out waiting for repl stream 'caches' to reach 416737417 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.http._base - 300 - WARNING - GET-2559861 - presence_set_state request timed out; retrying
synapse.replication.http._base - 312 - WARNING - PUT-899550 - fed_send_edu request connection failed; retrying in 1s: ConnectError(<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>)
synapse.http.client - 932 - INFO - PUT-901284 - Error sending request to  POST synapse-replication://master/_synapse/replication/fed_send_edu/m.presence/WCoECfmCdH: ConnectError [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]

clokep commented 1 year ago

I think this issue is mostly that "message retention is expensive and must be done on the main process"?

clokep commented 1 year ago

It seems like there's a few issues here though:

Message retention purging being expensive.
Message retention being handled by the main process.
Replication falling behind due to message retention using all CPU.

I'm unsure of the best approach to solve this, or if all 3 are somewhat needed.

abeluck commented 11 months ago

We want to give our users a message retention promise, but its currently not possible on large servers with databases in the hundreds gigabytes.

Message retention purging being expensive.

I'm not sure this is solvable. Purging involves touching potentially lots of rows.

It seems to me there are a couple possibilities:

Leave purge jobs on the primary process, but cap their run time or do some sort of coroutine style yielding to allow replication traffic to get cpu time. (There is prior art in limiting background jobs to a certain duration)
Move the purge job to a worker. I know this was tried before but was switched back to the primary process for what seems like lack of dev time, rather than a technical reason (though obviously communication happens outside Github issues)
- This would require moving the purge_history function mentioned in the above issue to a place where workers can access it, thereby moving the purge job off the main process

But since I'm not as familiar with synapse's internals as the devs I can't say which is the most attractive. But this issue is very painful for us and its difficult to explain to stakeholders that despite this feature being well documented, it in fact breaks the server and has been like that for quite a while.

abeluck commented 11 months ago

This recent commit removing the experimental warning from retention is not a good idea IMO. While it's fabulous there is no longer risk of corruption bugs, as this issue establishes, there is a serious bug in the retention feature as it exists: your server will essentially stop working (if you have a large number of events that need purging).

fb664 Remove warnings from the docs about using message retention.

matrix-org / synapse