Separate datastore for devices?

Fizzadar commented 2 years ago

Description:

We are currently experimenting with different ways to scale out the synapse database in particular where it would be possible to divide tables amongst separate database instances, much like the state tables/datastore class.

Based on my analysis it should be possible to extract the following device/e2e related stores into a separate datastore instance:

Device*Store (stores/main/devices.py)
DeviceInbox*Store (stores/main/deviceinbox.py)
EndToEndKey*Store (stores/main/end_to_end_keys.py)
ClientIp*Store (stores/main/client_ips.py)

I picked these because they're fairly small overall/low inter-dependency and represent a high percentage of database IO on our instance (currently single database all tables).

Note: the one interdependency this misses is the populate_monthly_active_users call in client_ips.py which could become self.hs.get_datastores().device.populate_monthly_active_users(user_id).

Is there any appetite for this? We can commit engineering time to implement this if so. Also keen to discuss any other groups of stores that may be suitable candidates.

dklimpel commented 2 years ago

Related to #11491

squahtx commented 2 years ago

AIUI we'd lose the ability to have foreign keys between the device tables and the rest of the database, which is unfortunate (not that we seem to have any).

Do you know if the database IO for those tables is mostly reads or writes? If they're mostly reads I'd be in favour of adding support for read replicas.

Fizzadar commented 2 years ago

It’s a pretty even mix of reads and writes. Read replicas would certainly be very helpful (across all of synapse) but I suspect would require significant changes both code and operationally to be supported.

The reason I suggested splitting our devices is because it seems like a logically separate group of tables (like state groups) that needn’t have any in-db joins even in the future.

Full context: as part of this same investigation I was considering how eventually the synapse events/rooms tables could be sharded by room id, which would in theory provide near infinite scale. Separating non room related tables is kind of a first step towards that.

erikjohnston commented 2 years ago

HI @Fizzadar, this is something we discussed a bit as our team. We think its totally feasible, we just have a couple of reservations:

Splitting these out into a separate DB is not without cost, e.g. lack of foreign keys, etc. It also potentially ties us down somewhat for future features, as we won't e.g. be able to join across tables on different databases. So we just want to make sure that the benefit outweighs the cost.
We're surprised that you're seeing those tables use so much IO, and we'd love to understand what's going on before we commit to splitting those tables out. There's no point doing the work if it turns out that there is a bug somewhere, or an inefficient query.

So, to move forwards here could you share more about what you're seeing? Ideally we'd like to know exactly which queries are using the IO, but not sure how granular your data is.

Fizzadar commented 2 years ago

Thanks for looking into this @erikjohnston! I pulled some DB stats on the highest read tables, combining both table + index blocks together to get the following rates (prom query here also):

(sum by (relname) (rate(pg_statio_all_tables_heap_blks_read{app="synapse-postgres-exporter"} [7d])))
  + (sum by (relname) (rate(pg_statio_all_tables_idx_blks_read{app="synapse-postgres-exporter"} [7d])))

table	total	data	index
e2e_room_keys	1270	8.50	1261
room_memberships	2433	1229	1204
events	1591	649	942
event_auth_chains	1135	465	670
user_ips	614	0.0322	614
state_groups_state	1064	623	440
event_auth_chain_links	778	469	309
event_json	388	133	255
event_edges	206	15.9	191
state_group_edges	681	517	164

This aligns with other charts indicating that the e2e_room_keys table is pretty heavy which would be the biggest gain in terms of splitting out read performance for us here. A large number of these were count queries that we've actually disabled (perhaps temporarily) as the results were unused in all the clients (this commit & this commit).

In regards to the general issue here though - our aim is ultimately to shard the events & events_json tables, plus whatever is needed to facilitate those. I think splitting out the device storage could be a small iteration towards that goal as a logical group of independent tables/APIs. It may make sense to bring additional tables/storage classes in as well.

Fizzadar commented 2 years ago

So this morning I added a new index on room_id to the e2e_room_keys field which has all but stopped the reads on that index. Still think separating the datastores is useful for other reasons, but the IO issue we started with is fixed by the index (will make a PR for that when I have the time).

matrix-org / synapse

Separate datastore for devices? #12906