matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

synapse performs inordinate amount of IO on DELETE FROM on the device_inbox table #16621

Closed Valodim closed 11 months ago

Valodim commented 11 months ago

Description

Some users have been complaining about slowness on my homeserver lately. Looking into it for a bit, I noticed that postgres is spending a large amount of time in DELETE FROM statements on the device_inbox table, and performs a huge amount of IO on these DELETE FROM statements:

image

This behavior is currently consistent: every minute or so, a bunch of postgres workers run these DELETE FROM statements, reading gigabytes of data.

This has caused consistent large amount of load since roughly the 12th, so two days after the v1.94.0 release. I don't precisely remember when I performed that update, typically I do them relatively quickly though.

image

This is a VM running only postgres, synapse, and a few related services. No major changes in config were made around the 12th, the last thing that happened was addition of mjolnir on august 25th. Synapse has had constant load of these DELETE FROM queries since then.

Steps to reproduce

sorry, no idea how to reproduce. if I can give any more insight I'll try to provide it, up to a call to inspect the server if that helps.

Homeserver

matrix.stratum0.org

Synapse Version

1.95.1

Installation Method

Debian packages from packages.matrix.org

Database

postgres, single server, never ported

Workers

Single process

Platform

Debian in a VM

Configuration

Presence is on :shrug:

Relevant log output

matrix=# SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
               nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
          FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
                                    query                                    | calls  |    total_time    |  rows   |     hit_percent     
-----------------------------------------------------------------------------+--------+------------------+---------+---------------------
 DELETE FROM device_inbox WHERE ctid IN (                                   +|   4227 | 125140420.808502 | 4213183 | 83.0263107880089348
                   SELECT ctid FROM device_inbox                            +|        |                  |         | 
                   WHERE user_id = $1 AND device_id = $2 AND stream_id <= $3+|        |                  |         | 
                   LIMIT $4                                                 +|        |                  |         | 
                 )                                                           |        |                  |         | 
 DELETE FROM device_inbox WHERE ctid IN (                                   +|   3414 | 105166717.010538 |   45377 | 69.3587470823784226
                   SELECT ctid FROM device_inbox                            +|        |                  |         | 
                   WHERE user_id = $1 AND device_id = $2 AND stream_id <= $3+|        |                  |         | 
                                                                            +|        |                  |         | 
                 )                                                           |        |                  |         | 
 SELECT COUNT(*) FROM e2e_room_keys WHERE user_id = $1 AND version = $2      |   2043 | 649328.821114998 |    2043 | 96.4265751586556058
 SELECT DISTINCT substring(state_key FROM $1)                               +| 231678 | 320314.261538005 | 6226101 | 97.9818494936092724
                 FROM current_state_events                                  +|        |                  |         | 
                 WHERE                                                      +|        |                  |         | 
                     type = $2                                              +|        |                  |         | 
                     AND membership = $3                                    +|        |                  |         | 
                     AND room_id = $4                                        |        |                  |         | 
 select count(*) from device_inbox                                           |     10 |    304487.276555 |      10 |  8.8124870074439053
(5 rows)

There are also errors like this in the log, relating to this statement:

Oct 20 22:24:44 metiadoc matrix-synapse[769]: 2023-10-20 22:24:44,127 - synapse.util.task_scheduler - 362 - ERROR - task-delete_device_messages-368-oyCvahTgzGOLEzSl- schedul>
Oct 20 22:24:44 metiadoc matrix-synapse[769]: Traceback (most recent call last):
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/util/task_scheduler.py", line 359, in wrapper
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     (status, result, error) = await function(task)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/device.py", line 406, in _delete_device_messages
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     res = await self.store.delete_messages_for_device(
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return await func(*args, **kwargs)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/databases/main/deviceinbox.py", line 493, in dele>
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     count = await self.db_pool.runInteraction(
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/database.py", line 934, in runInteraction
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return await delay_cancellation(_runInteraction())
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/internet/defer.py", line 1993, in _inlineCallbacks
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     result = context.run(
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenera>
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return g.throw(self.type, self.value, self.tb)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/database.py", line 900, in _runInteraction
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     result = await self.runWithConnection(
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/database.py", line 1029, in runWithConnection
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return await make_deferred_yieldable(
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/python/threadpool.py", line 244, in inContext
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     result = inContext.theWork()  # type: ignore[attr-defined]
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/python/context.py", line 117, in callWithContext
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return self.currentContext().callWithContext(ctx, func, *args, **kw)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/python/context.py", line 82, in callWithContext
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return func(*args, **kw)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 282, in _runWithConnection
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     result = func(conn, *args, **kw)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/database.py", line 1022, in inner_func
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return func(db_conn, *args, **kwargs)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/database.py", line 762, in new_transaction
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     r = func(cursor, *args, **kwargs)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/databases/main/deviceinbox.py", line 490, in dele>
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     txn.execute(sql, (user_id, device_id, up_to_stream_id))
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/database.py", line 421, in execute
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     self._do_execute(self.txn.execute, sql, parameters)
Oct 20 22:24:44 metiadoc matrix-synapse[769]:   File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/database.py", line 473, in _do_execute
Oct 20 22:24:44 metiadoc matrix-synapse[769]:     return func(sql, *args, **kwargs)
Oct 20 22:24:44 metiadoc matrix-synapse[769]: psycopg2.errors.QueryCanceled: canceling statement due to statement timeout


### Anything else that would be useful to know?

_No response_
Valodim commented 11 months ago

Ah, one more thing: The device_inbox table currently has ~4M values (which seems huge for ~130 active users, btw). And yesterday it was ~8.5M. Due to the timeouts happening, I increased the postgres connection limit. possibly because of this the delete statements started working and entries were actually deleted.

I'll keep investigating that query. It seems that postgres just isn't using the index to run it :thinking:

DMRobertson commented 11 months ago

I noticed that postgres is spending a large amount of time in DELETE FROM statements on the device_inbox table

https://github.com/matrix-org/synapse/pull/16492 may help with this. That's in 1.96.0rc1.

Valodim commented 11 months ago

ah, the issue described here goes back to https://github.com/matrix-org/synapse/issues/16479#issuecomment-1760363608. This issue caused many terabytes of useless IO for my relatively small instance and severely impacted responsiveness for some users, so glad it got sorted out quick.

nothing to see here, then. closing issue :)