matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.13k forks source link

devices table has case-sensitive user_id which leads to not removing devices from user #16474

Closed Kokokokoka closed 11 months ago

Kokokokoka commented 1 year ago

Description

When I was searching for reasons of device_inbox bloat. I've found that there are many devices in device table which have last_seen null. those devices had wrong case in user_id f.e. @user:domain | master signing key but @ User:domain | Element Android... If one uses admin api to logout device last_seen is set to null but the device is not being removed from devices table. iff name is lowercase everything works fine. there seems to be a problem that case-insensitive names start being case sensitive and a problem that case-sensitive user_ids appeared in devices when they shouldn't. if i match user_id case from devices, delete devices api it returns: {"errcode":"M_NOT_FOUND","error":"Unknown user"}

Steps to reproduce

don't know how to reproduce this reliably as I don't know how new devices are being created in table and what leads to this behaviour. probably this bug could affect other tables too.

Homeserver

personal

Synapse Version

1.94

Installation Method

Debian packages from packages.matrix.org

Database

postgresql, single instanse

Workers

Multiple workers

Platform

debian linux 11

Configuration

i have device retention in logs and ldap auth

Relevant log output

no logs for now

Anything else that would be useful to know?

matrix-synapse-ldap3/oldstable,now 0.1.4+git20201015+a3c7a9f-1 all [installed]

clokep commented 1 year ago

See also #3599.

Kokokokoka commented 1 year ago

See also #3599. I started from this.

briefly:

  1. device_inbox size was around 300Gb,
  2. even though there was delete_stale_devices_after: 31d setting
  3. ./synapse/storage/databases/main/devices.py (get_devices_not_accessed since) on line 1418 states:
    
    SELECT user_id, device_id
                FROM devices WHERE last_seen < ? AND hidden = FALSE

4. the problem is, that it doesn't take in account devices which has NULL as last_seen.
5. one might ask how could one get last_seen NULL for a device?
6. the answer is, that: even though user_id in devices ought to be lowercase, there were user_id's which had different case in devices, even the users table is lowercase for this user (and it all worked somehow, and this looks bad)
7. if one uses admin api to delete stale devices, device_inbox is being cleaned out (300Gb->70Gb)
BUT: those devices are still in devices table, so they are not being deleted (because of different case in name, most likely)

As a result there are problems:
1. last_seen being NULL even it shouldn't be in devices
2. user_id in devices has different case even if it should have the same case (or even stronger it should have only lowercase user_id's) as in user table
3. devices not being deleted from devices table (if user_id has different case than in users table)
4. device_inbox not being cleaned out (by delete_stale_devices_after for users which has case in user_id in devices)
Kokokokoka commented 1 year ago

table sizes now:

  table_schema    |                   table_name                   | row_estimate  |   total    |   index    |   toast    |   table    |    total_size_share
--------------------+------------------------------------------------+---------------+------------+------------+------------+------------+------------------------
 public             | e2e_room_keys                                  | 2.8482464e+07 | 31 GB      | 4153 MB    | 8192 bytes | 27 GB      |     0.6107138870843075
 public             | event_json                                     |   2.46887e+06 | 7519 MB    | 160 MB     | 5111 MB    | 2248 MB    |    0.14399412532695674
 public             | event_search                                   |  1.217233e+06 | 4091 MB    | 244 MB     | 3597 MB    | 249 MB     |    0.07834378627345777
 public             | device_lists_changes_in_room                   |  7.960826e+06 | 2032 MB    | 1071 MB    | 8192 bytes | 961 MB     |   0.038907440085486786
 public             | device_inbox                                   |  1.523041e+06 | 1453 MB    | 168 MB     | 8192 bytes | 1286 MB    |   0.027835572593596928
 public             | events                                         |  2.474202e+06 | 1316 MB    | 776 MB     | 8192 bytes | 540 MB     |    0.02519612915542897
 public             | stream_ordering_to_exterm                      |  4.289306e+06 | 830 MB     | 351 MB     | 8192 bytes | 479 MB     |   0.015898154379537035
 public             | event_edges                                    |  2.680084e+06 | 803 MB     | 479 MB     | 8192 bytes | 324 MB     |   0.015372988772827124
 public             | event_to_state_groups                          |  2.251233e+06 | 343 MB     | 162 MB     | 8192 bytes | 181 MB     |   0.006569208298634001
 public             | users_who_share_private_rooms                  |  1.147815e+06 | 290 MB     | 153 MB     | 8192 bytes | 137 MB     |   0.005549399530732363
 public             | event_push_actions                             |        325251 | 182 MB     | 124 MB     | 8192 bytes | 58 MB      |  0.0034868901607904477
 public             | cache_invalidation_stream_by_instance          |        803852 | 167 MB     | 43 MB      | 8192 bytes | 124 MB     |  0.0031897451764810792
 public             | current_state_delta_stream                     |        844013 | 154 MB     | 14 MB      | 8192 bytes | 140 MB     |   0.002943321314871044
 public             | event_auth                                     |        821425 | 146 MB     | 22 MB      | 8192 bytes | 123 MB     |   0.002789811368294301
 public             | user_daily_visits                              |        538030 | 132 MB     | 27 MB      | 8192 bytes | 105 MB     |   0.002532614878854319
 public             | state_groups_state                             |        287455 | 58 MB      | 17 MB      | 8192 bytes | 42 MB      |  0.0011169120382021323
 public             | state_events                                   |        271970 | 55 MB      | 18 MB      | 8192 bytes | 38 MB      |  0.0010594580231090822
 public             | device_federation_outbox                       |         70731 | 54 MB      | 4400 kB    | 4904 kB    | 45 MB      |  0.0010364165691394735
 public             | room_memberships                               |        150085 | 54 MB      | 21 MB      | 8192 bytes | 32 MB      |  0.0010274393792811844
 public             | current_state_events                           |        176004 | 54 MB      | 29 MB      | 8192 bytes | 24 MB      |   0.001025793561140498
 public             | receipts_linearized                            |        115313 | 53 MB      | 31 MB      | 8192 bytes | 22 MB      |  0.0010207064868874675
 public             | event_auth_chains                              |        271969 | 50 MB      | 26 MB      | 8192 bytes | 24 MB      |  0.0009572676785555581
 public             | event_auth_chain_links                         |        511193 | 44 MB      | 15 MB      |            | 29 MB      |  0.0008517856977206616
 public             | receipts_graph                                 |        115313 | 40 MB      | 20 MB      | 8192 bytes | 20 MB      |  0.0007691955510244021
 public             | device_federation_inbox                        |        330010 | 37 MB      | 16 MB      | 8192 bytes | 21 MB      |   0.000712040775593295
 public             | event_push_summary                             |        124509 | 33 MB      | 15 MB      | 8192 bytes | 19 MB      |  0.0006367820006146382
 public             | state_group_edges                              |        309174 | 29 MB      | 16 MB      |            | 13 MB      |   0.000552097176284778
 public             | e2e_one_time_keys_json                         |         63116 | 25 MB      | 5696 kB    | 8192 bytes | 20 MB      |  0.0004855163515024674
 public             | account_data                                   |         54912 | 23 MB      | 8712 kB    | 1224 kB    | 14 MB      |  0.0004479617739286248

I was selecting stale devices by getting user_id, device_id pair from:

select *, sum(dev_cnt) over () as ttl_filt from (select distinct
        user_id,
        device_id,
        count(*) over (partition by user_id, device_id) as dev_cnt,
        count(*) over (partition by user_id) as user_cnt,
        count(*) over () as ttl
from
        device_inbox di
order by
        user_cnt desc,
        dev_cnt desc)x where dev_cnt > 5000;

after deleting those I have found that some of the messages in device_inbox are still being present. and there are user_id, device_id pairs which aren't present in devices.

can one clean manually device_inbox if user_id,device_id is not present in devices?

Kokokokoka commented 1 year ago

for now I want to manually convert user_id to lowercase in devices inbox, probably one could add a trigger before insert which would do lowercase of user_id but it won't solve the bug itself| or changing the type of user_id column to citext (but, in all fairness this should be done by synapse team as a migration). sadly, I don't have enough time finding what causes adding user_id with different case in devices. and I'm afraid that this bug could interfere with other tables too, so it would be better if it would be possible finding the source of the problem and not fixing holes by patching db.

clokep commented 1 year ago

probably one could add a trigger before insert which would do lowercase of user_id but it won't solve the bug itself

Note that usernames are case sensitive, except during login If, and only if, there are not ambiguous users. It is a bit unclear how these entries could have been added in the first place. Maybe a client issue is involved? But it isn't really clear.

Kokokokoka commented 1 year ago

probably one could add a trigger before insert which would do lowercase of user_id but it won't solve the bug itself

Note that usernames are case sensitive, except during login If, and only if, there are not ambiguous users. It is a bit unclear how these entries could have been added in the first place. Maybe a client issue is involved? But it isn't really clear.

yep, there probably is some client issue as those users are lowercase in users table, but they have case in devices. ok. as i don't really know how to debug this further. I want to do this:

  1. use logout device on devices which has null in last access ts and which aren't hidden
  2. delete rows in device_inbox which are orphaned (no device in devices)
  3. delete rows in devices which has last_seen null and hidden is false

will this work? and by work i mean it wouldn't harm the inner synapse logic

reivilibre commented 11 months ago

(I notice from the description that LDAP auth is in use. LDAP auth has been sloppy with username cases in the past, so I'd reckon it's worth a look there to see if that might be part of the reason.)

erikjohnston commented 11 months ago

Ah, looks like the LDAP provider had a bug around case sensitivity: https://github.com/matrix-org/matrix-synapse-ldap3/pull/163 and https://github.com/matrix-org/matrix-synapse-ldap3/pull/168.

Hopefully that means therefore this issue is fixed (albeit the stale entries in the DB that haven't been cleaned up)?

Kokokokoka commented 11 months ago

Ah, looks like the LDAP provider had a bug around case sensitivity: matrix-org/matrix-synapse-ldap3#163 and matrix-org/matrix-synapse-ldap3#168.

Hopefully that means therefore this issue is fixed (albeit the stale entries in the DB that haven't been cleaned up)?

the question stands still: can one safely delete those devices and device_inbox. or is there something else that one should do beforehand|additionally? after this i'll check whether those user_id's appear in devices, or not.