matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.13k forks source link

Very slow deletes from `device_inbox` during `/sync` requests #16479

Closed richvdh closed 1 year ago

richvdh commented 1 year ago

Description

Synchrotron workers are maxing out db connections, resulting in multi-minute scheduling time:

image

Inspection shows that these are operations of the shape

                DELETE FROM device_inbox WHERE ctid IN (
                  SELECT ctid FROM device_inbox
                  WHERE user_id = '@<user_id>' AND device_id = 'LTHNOYWSMJ' AND stream_id <= 7526430
                  LIMIT 100
                )

Some queries have LIMIT 1000 rather than LIMIT 100.

Steps to reproduce

Homeserver

EMS host

Synapse Version

1.94.0

Installation Method

Docker (matrixdotorg/synapse)

Database

postgres 11

Workers

Multiple workers

Platform

k8s

Configuration

No response

Relevant log output

...

Anything else that would be useful to know?

No response

richvdh commented 1 year ago

an EXPLAIN on the queries in question shows that they are sequence-scanning the table to do the delete:

Delete on device_inbox  (cost=8.60..2559182.13 rows=1 width=36)
  ->  Nested Loop  (cost=8.60..2559182.13 rows=1 width=36)
        Join Filter: (device_inbox.ctid = "ANY_subquery".ctid)
        ->  Unique  (cost=8.60..8.61 rows=1 width=36)
              ->  Sort  (cost=8.60..8.61 rows=1 width=36)
                    Sort Key: "ANY_subquery".ctid
                    ->  Subquery Scan on "ANY_subquery"  (cost=0.56..8.59 rows=1 width=36)
                          ->  Limit  (cost=0.56..8.58 rows=1 width=6)
                                ->  Index Scan using device_inbox_user_stream_id on device_inbox device_inbox_1  (cost=0.56..8.58 rows=1 width=6)
                                      Index Cond: ((user_id = '<user_id>'::text) AND (device_id = 'SAJAWKWGHO'::text) AND (stream_id <= 7535635))
        ->  Seq Scan on device_inbox  (cost=0.00..2376344.90 rows=14626290 width=6)

Apparently the query optimiser isn't smart enough to know that it would do better searching for each of the 100 ctids individually than scanning the whole table.

richvdh commented 1 year ago

This code changed recently in https://github.com/matrix-org/synapse/pull/16240 and again in #16312.

A couple of thoughts:

MatMaul commented 1 year ago

Apparently the query optimiser isn't smart enough to know that it would do better searching for each of the 100 ctids individually than scanning the whole table.

Sadness... https://postgrespro.com/list/thread-id/2446912

For the cleanup when a device is deleted, we should probably pick a stream_id to delete up to for each batch, rather than messing with ctid.

something along that?

DELETE FROM device_inbox WHERE user_id = ? AND device_id = ? AND stream_id <= (
  SELECT MAX(stream_id) FROM device_inbox
  WHERE user_id = '@<user_id>' AND device_id = 'LTHNOYWSMJ' AND stream_id <= 7526430
  LIMIT 100
)

I'm surprised it was necessary to change the code in the SyncHandler at all. We know that will be deleting at most 100 rows, because they are exactly the rows that were served to the client on the previous /sync. AFAICT there is no need to LIMIT the delete, and therefore no need to mess about with ctid and no need for the fallback-to-async-task.

:facepalm: Indeed, I just didn't think about it.

kegsay commented 1 year ago

The query planner woes sounds similar to https://github.com/matrix-org/sliding-sync/commit/fa6746796c0391ea269485b3167195cc386b43a1 except in that case we wanted a seq scan.

richvdh commented 1 year ago

Apparently the query optimiser isn't smart enough to know that it would do better searching for each of the 100 ctids individually than scanning the whole table.

Sadness... https://postgrespro.com/list/thread-id/2446912

For the cleanup when a device is deleted, we should probably pick a stream_id to delete up to for each batch, rather than messing with ctid.

something along that?

DELETE FROM device_inbox WHERE user_id = ? AND device_id = ? AND stream_id <= (
  SELECT MAX(stream_id) FROM device_inbox
  WHERE user_id = '@<user_id>' AND device_id = 'LTHNOYWSMJ' AND stream_id <= 7526430
  LIMIT 100
)

Yes. Except you need an ORDER BY stream_id in the inner query.

And I don't think you need stream_id <= 7526430 for device deletions (we need to delete all messages for the given device, irrespective of stream_id). It's harmless, but redundant.

MatMaul commented 1 year ago

And I don't think you need stream_id <= 7526430 for device deletions (we need to delete all messages for the given device, irrespective of stream_id). It's harmless, but redundant.

Someone can create a device with the same ID, and without the check it would then delete the messages aimed at the new device.

richvdh commented 1 year ago

well, that would be a stupid thing to do :)

But fair enough. Stick a comment in, though - that's quite subtle!

MatMaul commented 1 year ago

I don't get the same plan with my deploy on pg15:

 Delete on device_inbox  (cost=1801.45..1962.75 rows=0 width=0)
   ->  Nested Loop  (cost=1801.45..1962.75 rows=40 width=36)
         ->  HashAggregate  (cost=1801.45..1801.85 rows=40 width=36)
               Group Key: "ANY_subquery".ctid
               ->  Subquery Scan on "ANY_subquery"  (cost=1641.08..1801.35 rows=40 width=36)
                     ->  Limit  (cost=1641.08..1800.95 rows=40 width=6)
                           ->  Bitmap Heap Scan on device_inbox device_inbox_1  (cost=1641.08..1800.95 rows=40 width=6)
                                 Recheck Cond: ((user_id = '@mxid'::text) AND (device_id = 'DEVICEID'::text))
                                 ->  Bitmap Index Scan on device_inbox_user_stream_id  (cost=0.00..1641.07 rows=40 width=0)
                                       Index Cond: ((user_id = '@mxid'::text) AND (device_id = 'DEVICEID'::text))
         ->  Tid Scan on device_inbox  (cost=0.00..4.01 rows=1 width=6)
               TID Cond: (ctid = "ANY_subquery".ctid)
(12 rows)

With #16492 , I get:

 Delete on device_inbox  (cost=2178.99..2231.08 rows=0 width=0)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=1646.15..1646.16 rows=1 width=8)
           ->  Limit  (cost=0.67..1645.65 rows=40 width=8)
                 ->  Index Only Scan using device_inbox_user_stream_id on device_inbox device_inbox_1  (cost=0.67..1645.65 rows=40 width=8)
                       Index Cond: ((user_id = '@mat:tout.im'::text) AND (device_id = 'CCUJZYMJLP'::text))
   ->  Bitmap Heap Scan on device_inbox  (cost=532.83..584.92 rows=13 width=6)
         Recheck Cond: ((user_id = '@mat:tout.im'::text) AND (device_id = 'CCUJZYMJLP'::text) AND (stream_id <= $0))
         ->  Bitmap Index Scan on device_inbox_user_stream_id  (cost=0.00..532.83 rows=13 width=0)
               Index Cond: ((user_id = '@mat:tout.im'::text) AND (device_id = 'CCUJZYMJLP'::text) AND (stream_id <= $0))
(10 rows)

So it's probably a tad slower on more modern PG, but at least we will (?) not get a seqscan on older ones.

@richvdh could you do a comparison with your test DB to check the new query is ok on PG11?

Also note that my new query delete 10 rows and not 12, I need to check why.

MatMaul commented 1 year ago

The indicated rows is the number of lines for the plan and not the number of deleted records :sob: .

I have executed the 2 inner SELECT and they do return the same number of rows for my test case (125 rows).

clokep commented 1 year ago

I think this was fixed by #16491?

richvdh commented 1 year ago

16491 will only have fixed half the problem.

To-device message deletions happen in two places: 1) after they are successfully sent to the target device via /sync; 2) when the target device is deleted. Both were using delete_messages_for_device_txn which uses a very inefficient query plan on Postgres 11.

16491 changes the /sync case to no longer use delete_messages_for_device_txn and therefore fixes case 1, but does not change the query plan in that function so does not fix case 2.

richvdh commented 1 year ago

oh, in fairness, the subject of this issue is just about the /sync requests, so yes, I guess it is fixed. Sorry.

For completeness, the device-deletion side looks to be covered in #16492.