Closed Kokokokoka closed 11 months ago
See also #3599.
See also #3599. I started from this.
briefly:
SELECT user_id, device_id
FROM devices WHERE last_seen < ? AND hidden = FALSE
4. the problem is, that it doesn't take in account devices which has NULL as last_seen.
5. one might ask how could one get last_seen NULL for a device?
6. the answer is, that: even though user_id in devices ought to be lowercase, there were user_id's which had different case in devices, even the users table is lowercase for this user (and it all worked somehow, and this looks bad)
7. if one uses admin api to delete stale devices, device_inbox is being cleaned out (300Gb->70Gb)
BUT: those devices are still in devices table, so they are not being deleted (because of different case in name, most likely)
As a result there are problems:
1. last_seen being NULL even it shouldn't be in devices
2. user_id in devices has different case even if it should have the same case (or even stronger it should have only lowercase user_id's) as in user table
3. devices not being deleted from devices table (if user_id has different case than in users table)
4. device_inbox not being cleaned out (by delete_stale_devices_after for users which has case in user_id in devices)
table sizes now:
table_schema | table_name | row_estimate | total | index | toast | table | total_size_share
--------------------+------------------------------------------------+---------------+------------+------------+------------+------------+------------------------
public | e2e_room_keys | 2.8482464e+07 | 31 GB | 4153 MB | 8192 bytes | 27 GB | 0.6107138870843075
public | event_json | 2.46887e+06 | 7519 MB | 160 MB | 5111 MB | 2248 MB | 0.14399412532695674
public | event_search | 1.217233e+06 | 4091 MB | 244 MB | 3597 MB | 249 MB | 0.07834378627345777
public | device_lists_changes_in_room | 7.960826e+06 | 2032 MB | 1071 MB | 8192 bytes | 961 MB | 0.038907440085486786
public | device_inbox | 1.523041e+06 | 1453 MB | 168 MB | 8192 bytes | 1286 MB | 0.027835572593596928
public | events | 2.474202e+06 | 1316 MB | 776 MB | 8192 bytes | 540 MB | 0.02519612915542897
public | stream_ordering_to_exterm | 4.289306e+06 | 830 MB | 351 MB | 8192 bytes | 479 MB | 0.015898154379537035
public | event_edges | 2.680084e+06 | 803 MB | 479 MB | 8192 bytes | 324 MB | 0.015372988772827124
public | event_to_state_groups | 2.251233e+06 | 343 MB | 162 MB | 8192 bytes | 181 MB | 0.006569208298634001
public | users_who_share_private_rooms | 1.147815e+06 | 290 MB | 153 MB | 8192 bytes | 137 MB | 0.005549399530732363
public | event_push_actions | 325251 | 182 MB | 124 MB | 8192 bytes | 58 MB | 0.0034868901607904477
public | cache_invalidation_stream_by_instance | 803852 | 167 MB | 43 MB | 8192 bytes | 124 MB | 0.0031897451764810792
public | current_state_delta_stream | 844013 | 154 MB | 14 MB | 8192 bytes | 140 MB | 0.002943321314871044
public | event_auth | 821425 | 146 MB | 22 MB | 8192 bytes | 123 MB | 0.002789811368294301
public | user_daily_visits | 538030 | 132 MB | 27 MB | 8192 bytes | 105 MB | 0.002532614878854319
public | state_groups_state | 287455 | 58 MB | 17 MB | 8192 bytes | 42 MB | 0.0011169120382021323
public | state_events | 271970 | 55 MB | 18 MB | 8192 bytes | 38 MB | 0.0010594580231090822
public | device_federation_outbox | 70731 | 54 MB | 4400 kB | 4904 kB | 45 MB | 0.0010364165691394735
public | room_memberships | 150085 | 54 MB | 21 MB | 8192 bytes | 32 MB | 0.0010274393792811844
public | current_state_events | 176004 | 54 MB | 29 MB | 8192 bytes | 24 MB | 0.001025793561140498
public | receipts_linearized | 115313 | 53 MB | 31 MB | 8192 bytes | 22 MB | 0.0010207064868874675
public | event_auth_chains | 271969 | 50 MB | 26 MB | 8192 bytes | 24 MB | 0.0009572676785555581
public | event_auth_chain_links | 511193 | 44 MB | 15 MB | | 29 MB | 0.0008517856977206616
public | receipts_graph | 115313 | 40 MB | 20 MB | 8192 bytes | 20 MB | 0.0007691955510244021
public | device_federation_inbox | 330010 | 37 MB | 16 MB | 8192 bytes | 21 MB | 0.000712040775593295
public | event_push_summary | 124509 | 33 MB | 15 MB | 8192 bytes | 19 MB | 0.0006367820006146382
public | state_group_edges | 309174 | 29 MB | 16 MB | | 13 MB | 0.000552097176284778
public | e2e_one_time_keys_json | 63116 | 25 MB | 5696 kB | 8192 bytes | 20 MB | 0.0004855163515024674
public | account_data | 54912 | 23 MB | 8712 kB | 1224 kB | 14 MB | 0.0004479617739286248
I was selecting stale devices by getting user_id, device_id pair from:
select *, sum(dev_cnt) over () as ttl_filt from (select distinct
user_id,
device_id,
count(*) over (partition by user_id, device_id) as dev_cnt,
count(*) over (partition by user_id) as user_cnt,
count(*) over () as ttl
from
device_inbox di
order by
user_cnt desc,
dev_cnt desc)x where dev_cnt > 5000;
after deleting those I have found that some of the messages in device_inbox are still being present. and there are user_id, device_id pairs which aren't present in devices.
can one clean manually device_inbox if user_id,device_id is not present in devices?
for now I want to manually convert user_id to lowercase in devices inbox, probably one could add a trigger before insert which would do lowercase of user_id but it won't solve the bug itself| or changing the type of user_id column to citext (but, in all fairness this should be done by synapse team as a migration). sadly, I don't have enough time finding what causes adding user_id with different case in devices. and I'm afraid that this bug could interfere with other tables too, so it would be better if it would be possible finding the source of the problem and not fixing holes by patching db.
probably one could add a trigger before insert which would do lowercase of user_id but it won't solve the bug itself
Note that usernames are case sensitive, except during login If, and only if, there are not ambiguous users. It is a bit unclear how these entries could have been added in the first place. Maybe a client issue is involved? But it isn't really clear.
probably one could add a trigger before insert which would do lowercase of user_id but it won't solve the bug itself
Note that usernames are case sensitive, except during login If, and only if, there are not ambiguous users. It is a bit unclear how these entries could have been added in the first place. Maybe a client issue is involved? But it isn't really clear.
yep, there probably is some client issue as those users are lowercase in users table, but they have case in devices. ok. as i don't really know how to debug this further. I want to do this:
will this work? and by work i mean it wouldn't harm the inner synapse logic
(I notice from the description that LDAP auth is in use. LDAP auth has been sloppy with username cases in the past, so I'd reckon it's worth a look there to see if that might be part of the reason.)
Ah, looks like the LDAP provider had a bug around case sensitivity: https://github.com/matrix-org/matrix-synapse-ldap3/pull/163 and https://github.com/matrix-org/matrix-synapse-ldap3/pull/168.
Hopefully that means therefore this issue is fixed (albeit the stale entries in the DB that haven't been cleaned up)?
Ah, looks like the LDAP provider had a bug around case sensitivity: matrix-org/matrix-synapse-ldap3#163 and matrix-org/matrix-synapse-ldap3#168.
Hopefully that means therefore this issue is fixed (albeit the stale entries in the DB that haven't been cleaned up)?
the question stands still: can one safely delete those devices and device_inbox. or is there something else that one should do beforehand|additionally? after this i'll check whether those user_id's appear in devices, or not.
Description
When I was searching for reasons of device_inbox bloat. I've found that there are many devices in device table which have last_seen null. those devices had wrong case in user_id f.e. @user:domain | master signing key but @ User:domain | Element Android... If one uses admin api to logout device last_seen is set to null but the device is not being removed from devices table. iff name is lowercase everything works fine. there seems to be a problem that case-insensitive names start being case sensitive and a problem that case-sensitive user_ids appeared in devices when they shouldn't. if i match user_id case from devices, delete devices api it returns: {"errcode":"M_NOT_FOUND","error":"Unknown user"}
Steps to reproduce
don't know how to reproduce this reliably as I don't know how new devices are being created in table and what leads to this behaviour. probably this bug could affect other tables too.
Homeserver
personal
Synapse Version
1.94
Installation Method
Debian packages from packages.matrix.org
Database
postgresql, single instanse
Workers
Multiple workers
Platform
debian linux 11
Configuration
i have device retention in logs and ldap auth
Relevant log output
Anything else that would be useful to know?
matrix-synapse-ldap3/oldstable,now 0.1.4+git20201015+a3c7a9f-1 all [installed]