matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.8k stars 2.13k forks source link

Separate datastore for devices? #12906

Open Fizzadar opened 2 years ago

Fizzadar commented 2 years ago

Description:

We are currently experimenting with different ways to scale out the synapse database in particular where it would be possible to divide tables amongst separate database instances, much like the state tables/datastore class.

Based on my analysis it should be possible to extract the following device/e2e related stores into a separate datastore instance:

I picked these because they're fairly small overall/low inter-dependency and represent a high percentage of database IO on our instance (currently single database all tables).

Note: the one interdependency this misses is the populate_monthly_active_users call in client_ips.py which could become self.hs.get_datastores().device.populate_monthly_active_users(user_id).

Is there any appetite for this? We can commit engineering time to implement this if so. Also keen to discuss any other groups of stores that may be suitable candidates.

dklimpel commented 2 years ago

Related to #11491

squahtx commented 2 years ago

AIUI we'd lose the ability to have foreign keys between the device tables and the rest of the database, which is unfortunate (not that we seem to have any).

Do you know if the database IO for those tables is mostly reads or writes? If they're mostly reads I'd be in favour of adding support for read replicas.

Fizzadar commented 2 years ago

It’s a pretty even mix of reads and writes. Read replicas would certainly be very helpful (across all of synapse) but I suspect would require significant changes both code and operationally to be supported.

The reason I suggested splitting our devices is because it seems like a logically separate group of tables (like state groups) that needn’t have any in-db joins even in the future.

Full context: as part of this same investigation I was considering how eventually the synapse events/rooms tables could be sharded by room id, which would in theory provide near infinite scale. Separating non room related tables is kind of a first step towards that.

erikjohnston commented 2 years ago

HI @Fizzadar, this is something we discussed a bit as our team. We think its totally feasible, we just have a couple of reservations:

  1. Splitting these out into a separate DB is not without cost, e.g. lack of foreign keys, etc. It also potentially ties us down somewhat for future features, as we won't e.g. be able to join across tables on different databases. So we just want to make sure that the benefit outweighs the cost.
  2. We're surprised that you're seeing those tables use so much IO, and we'd love to understand what's going on before we commit to splitting those tables out. There's no point doing the work if it turns out that there is a bug somewhere, or an inefficient query.

So, to move forwards here could you share more about what you're seeing? Ideally we'd like to know exactly which queries are using the IO, but not sure how granular your data is.

Fizzadar commented 2 years ago

Thanks for looking into this @erikjohnston! I pulled some DB stats on the highest read tables, combining both table + index blocks together to get the following rates (prom query here also):

(sum by (relname) (rate(pg_statio_all_tables_heap_blks_read{app="synapse-postgres-exporter"} [7d])))
  + (sum by (relname) (rate(pg_statio_all_tables_idx_blks_read{app="synapse-postgres-exporter"} [7d])))
table total data index
e2e_room_keys 1270 8.50 1261
room_memberships 2433 1229 1204
events 1591 649 942
event_auth_chains 1135 465 670
user_ips 614 0.0322 614
state_groups_state 1064 623 440
event_auth_chain_links 778 469 309
event_json 388 133 255
event_edges 206 15.9 191
state_group_edges 681 517 164

This aligns with other charts indicating that the e2e_room_keys table is pretty heavy which would be the biggest gain in terms of splitting out read performance for us here. A large number of these were count queries that we've actually disabled (perhaps temporarily) as the results were unused in all the clients (this commit & this commit).


In regards to the general issue here though - our aim is ultimately to shard the events & events_json tables, plus whatever is needed to facilitate those. I think splitting out the device storage could be a small iteration towards that goal as a logical group of independent tables/APIs. It may make sense to bring additional tables/storage classes in as well.

Fizzadar commented 2 years ago

So this morning I added a new index on room_id to the e2e_room_keys field which has all but stopped the reads on that index. Still think separating the datastores is useful for other reasons, but the IO issue we started with is fixed by the index (will make a PR for that when I have the time).