matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.12k forks source link

Public room list does not return results when searching for room title with uppercase umlaut #15419

Open subjugum opened 1 year ago

subjugum commented 1 year ago

Description

When searching for a room in the public room directory via Element, it will not appear when entering the room title. Manually scrolling down without entering any search terms will show the room, however. From my limited testing, this only seems to affect rooms with uppercase umlaut characters (ÄÖÜ). Room titles with said characters anywhere else will be found as expected, see examples below.

Steps to reproduce

In Element

Results are the same for requests sent to either main process and generic_workers. Examples with curl:

Room title is Apföl

# curl -s -H 'Authorization: Bearer xxx' -d '{ "filter": { "generic_search_term": "Apföl", "room_types": null } }' localhost:8008/_matrix/client/v3/publicRooms | jq '.chunk | .[]'
{
  "room_id": "!pRruVxBDfrjmlCdIyF:example.org",
  "name": "Apföl",
  "topic": "Thema\n",
  "canonical_alias": "#testraum:example.org",
  "num_joined_members": 3,
  "world_readable": false,
  "guest_can_join": true,
  "join_rule": "public"
}

Then change the room title to Öach

# curl -s -H 'Authorization: Bearer xxx' -d '{ "filter": { "generic_search_term": "Öach", "room_types": null } }' localhost:8008/_matrix/client/v3/publicRooms
{
  "chunk": [],
  "total_room_count_estimate": 30
}

However, removing the generic_search_term will give a hit

# curl -s -H 'Authorization: Bearer xxx' -d '{ "filter": { "generic_search_term": "", "room_types": null } }' localhost:8008/_matrix/client/v3/publicRooms | jq '.chunk | .[] | select(.name == "Öach")'
{
  "room_id": "!pRruVxBDfrjmlCdIyF:example.org",
  "name": "Öach",
  "topic": "Thema\n",
  "canonical_alias": "#testraum:example.org",
  "num_joined_members": 3,
  "world_readable": false,
  "guest_can_join": true,
  "join_rule": "public"
}

Homeserver

Local test install

Synapse Version

1.80.0

Installation Method

Debian packages from packages.matrix.org

Database

PostgresSQL (13.9-0+deb11u1, same host), fresh install

Workers

Multiple workers

Platform

VM, Debian 11

Configuration

No response

Relevant log output

# room title = Apföl, generic_search_term = "Apföl"
2023-04-11 18:35:04,955 - synapse.access.http.8008 - 415 - DEBUG - POST-121 - ::1 - 8008 - Received request: POST /_matrix/client/v3/publicRooms
2023-04-11 18:35:04,959 - synapse.storage.txn - 710 - DEBUG - POST-121 - [TXN START] {get_user_by_access_token-95}
2023-04-11 18:35:04,959 - synapse.storage.SQL - 426 - DEBUG - POST-121 - [SQL] {get_user_by_access_token-95} SELECT users.name as user_id, users.is_guest, users.shadow_banne
d, access_tokens.id as token_id, access_tokens.device_id, access_tokens.valid_until_ms, access_tokens.user_id as token_owner, access_tokens.used as token_used FROM users INN
ER JOIN access_tokens on users.name = COALESCE(puppets_user_id, access_tokens.user_id) WHERE token = ?
2023-04-11 18:35:04,959 - synapse.storage.SQL - 431 - DEBUG - POST-121 - [SQL values] {get_user_by_access_token-95} ('xxx',)
2023-04-11 18:35:04,960 - synapse.storage.SQL - 452 - DEBUG - POST-121 - [SQL time] {get_user_by_access_token-95} 0.000929 sec
2023-04-11 18:35:04,960 - synapse.storage.txn - 814 - DEBUG - POST-121 - [TXN END] {get_user_by_access_token-95} 0.001381 sec
2023-04-11 18:35:04,961 - synapse.storage.txn - 710 - DEBUG - POST-121 - [TXN START] {mark_access_token_as_used-96}
2023-04-11 18:35:04,961 - synapse.storage.SQL - 426 - DEBUG - POST-121 - [SQL] {mark_access_token_as_used-96} UPDATE access_tokens SET used = ? WHERE id = ?
2023-04-11 18:35:04,961 - synapse.storage.SQL - 431 - DEBUG - POST-121 - [SQL values] {mark_access_token_as_used-96} [True, 4539]
2023-04-11 18:35:04,962 - synapse.storage.SQL - 452 - DEBUG - POST-121 - [SQL time] {mark_access_token_as_used-96} 0.000772 sec
2023-04-11 18:35:04,962 - synapse.storage.txn - 814 - DEBUG - POST-121 - [TXN END] {mark_access_token_as_used-96} 0.001181 sec
2023-04-11 18:35:04,963 - synapse.handlers.room_list - 90 - INFO - POST-121 - Getting public room list: limit=100, since=None, search=True, network=ThirdPartyInstanceID(apps
ervice_id=None, network_id=None)
2023-04-11 18:35:04,963 - synapse.handlers.room_list - 101 - INFO - POST-121 - Bypassing cache as search request.
2023-04-11 18:35:04,963 - synapse.storage.txn - 710 - DEBUG - POST-121 - [TXN START] {get_largest_public_rooms-97}
2023-04-11 18:35:04,963 - synapse.storage.SQL - 426 - DEBUG - POST-121 - [SQL] {get_largest_public_rooms-97} SELECT room_id, name, topic, canonical_alias, joined_members, av
atar, history_visibility, guest_access, join_rules, room_type FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER 
JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) A
ND joined_members > 0 AND ( LOWER(name) LIKE ? OR LOWER(topic) LIKE ? OR LOWER(canonical_alias) LIKE ? ) ORDER BY joined_members DESC, room_id DESC LIMIT ?
2023-04-11 18:35:04,963 - synapse.storage.SQL - 431 - DEBUG - POST-121 - [SQL values] {get_largest_public_rooms-97} ['%apföl%', '%apföl%', '%apföl%', 101]
2023-04-11 18:35:04,964 - synapse.storage.txn - 710 - DEBUG - replication-USER_IP-55 - [TXN START] {get_user_by_id-98}
2023-04-11 18:35:04,964 - synapse.storage.SQL - 426 - DEBUG - replication-USER_IP-55 - [SQL] {get_user_by_id-98} SELECT name, password_hash, is_guest, admin, consent_version, consent_ts, consent_server_notice_sent, appservice_id, creation_ts, user_type, deactivated, COALESCE(shadow_banned, FALSE) AS shadow_banned, COALESCE(approved, TRUE) AS approved FROM users WHERE name = ?
2023-04-11 18:35:04,964 - synapse.storage.SQL - 431 - DEBUG - replication-USER_IP-55 - [SQL values] {get_user_by_id-98} ('@testmensch_bob:example.org',)
2023-04-11 18:35:04,965 - synapse.storage.SQL - 452 - DEBUG - replication-USER_IP-55 - [SQL time] {get_user_by_id-98} 0.000373 sec
2023-04-11 18:35:04,965 - synapse.storage.txn - 814 - DEBUG - replication-USER_IP-55 - [TXN END] {get_user_by_id-98} 0.000890 sec
2023-04-11 18:35:04,965 - synapse.storage.SQL - 452 - DEBUG - POST-121 - [SQL time] {get_largest_public_rooms-97} 0.002291 sec
2023-04-11 18:35:04,966 - synapse.storage.txn - 814 - DEBUG - POST-121 - [TXN END] {get_largest_public_rooms-97} 0.002620 sec
2023-04-11 18:35:04,966 - synapse.storage.txn - 710 - DEBUG - POST-121 - [TXN START] {count_public_rooms-99}
2023-04-11 18:35:04,966 - synapse.storage.SQL - 426 - DEBUG - POST-121 - [SQL] {count_public_rooms-99} SELECT COUNT(*) FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) AND joined_members > 0
2023-04-11 18:35:04,966 - synapse.storage.SQL - 431 - DEBUG - POST-121 - [SQL values] {count_public_rooms-99} []
2023-04-11 18:35:04,968 - synapse.storage.SQL - 452 - DEBUG - POST-121 - [SQL time] {count_public_rooms-99} 0.002087 sec
2023-04-11 18:35:04,969 - synapse.storage.txn - 814 - DEBUG - POST-121 - [TXN END] {count_public_rooms-99} 0.002664 sec
2023-04-11 18:35:04,970 - synapse.access.http.8008 - 460 - INFO - POST-121 - ::1 - 8008 - {@testmensch_bob:example.org} Processed request: 0.015sec/0.000sec (0.003sec, 0.001sec) (0.004sec/0.008sec/4) 257B 200 "POST /_matrix/client/v3/publicRooms HTTP/1.1" "curl/7.74.0" [0 dbevts]

# room title = Öach, generic_search_term = "Öach"
2023-04-11 18:47:46,742 - synapse.access.http.8008 - 415 - DEBUG - POST-356 - ::1 - 8008 - Received request: POST /_matrix/client/v3/publicRooms
2023-04-11 18:47:46,743 - synapse.storage.txn - 710 - DEBUG - POST-356 - [TXN START] {get_user_by_access_token-181}
2023-04-11 18:47:46,743 - synapse.storage.SQL - 426 - DEBUG - POST-356 - [SQL] {get_user_by_access_token-181} SELECT users.name as user_id, users.is_guest, users.shadow_bann
ed, access_tokens.id as token_id, access_tokens.device_id, access_tokens.valid_until_ms, access_tokens.user_id as token_owner, access_tokens.used as token_used FROM users IN
NER JOIN access_tokens on users.name = COALESCE(puppets_user_id, access_tokens.user_id) WHERE token = ?
2023-04-11 18:47:46,743 - synapse.storage.SQL - 431 - DEBUG - POST-356 - [SQL values] {get_user_by_access_token-181} ('xxx',)
2023-04-11 18:47:46,745 - synapse.storage.SQL - 452 - DEBUG - POST-356 - [SQL time] {get_user_by_access_token-181} 0.001587 sec
2023-04-11 18:47:46,745 - synapse.storage.txn - 814 - DEBUG - POST-356 - [TXN END] {get_user_by_access_token-181} 0.002667 sec
2023-04-11 18:47:46,747 - synapse.storage.txn - 710 - DEBUG - POST-356 - [TXN START] {mark_access_token_as_used-182}
2023-04-11 18:47:46,747 - synapse.storage.SQL - 426 - DEBUG - POST-356 - [SQL] {mark_access_token_as_used-182} UPDATE access_tokens SET used = ? WHERE id = ?
2023-04-11 18:47:46,747 - synapse.storage.SQL - 431 - DEBUG - POST-356 - [SQL values] {mark_access_token_as_used-182} [True, 4539]
2023-04-11 18:47:46,748 - synapse.storage.SQL - 452 - DEBUG - POST-356 - [SQL time] {mark_access_token_as_used-182} 0.000796 sec
2023-04-11 18:47:46,748 - synapse.storage.txn - 814 - DEBUG - POST-356 - [TXN END] {mark_access_token_as_used-182} 0.001326 sec
2023-04-11 18:47:46,749 - synapse.handlers.room_list - 90 - INFO - POST-356 - Getting public room list: limit=100, since=None, search=True, network=ThirdPartyInstanceID(appservice_id=None, network_id=None)
2023-04-11 18:47:46,749 - synapse.handlers.room_list - 101 - INFO - POST-356 - Bypassing cache as search request.
2023-04-11 18:47:46,750 - synapse.storage.txn - 710 - DEBUG - POST-356 - [TXN START] {get_largest_public_rooms-183}
2023-04-11 18:47:46,750 - synapse.storage.SQL - 426 - DEBUG - POST-356 - [SQL] {get_largest_public_rooms-183} SELECT room_id, name, topic, canonical_alias, joined_members, avatar, history_visibility, guest_access, join_rules, room_type FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) AND joined_members > 0 AND ( LOWER(name) LIKE ? OR LOWER(topic) LIKE ? OR LOWER(canonical_alias) LIKE ? ) ORDER BY joined_members DESC, room_id DESC LIMIT ?
2023-04-11 18:47:46,750 - synapse.storage.SQL - 431 - DEBUG - POST-356 - [SQL values] {get_largest_public_rooms-183} ['%öach%', '%öach%', '%öach%', 101]
2023-04-11 18:47:46,752 - synapse.storage.txn - 710 - DEBUG - replication-USER_IP-28 - [TXN START] {get_user_by_id-184}
2023-04-11 18:47:46,752 - synapse.storage.SQL - 426 - DEBUG - replication-USER_IP-28 - [SQL] {get_user_by_id-184} SELECT name, password_hash, is_guest, admin, consent_version, consent_ts, consent_server_notice_sent, appservice_id, creation_ts, user_type, deactivated, COALESCE(shadow_banned, FALSE) AS shadow_banned, COALESCE(approved, TRUE) AS approved FROM users WHERE name = ?
2023-04-11 18:47:46,752 - synapse.storage.SQL - 431 - DEBUG - replication-USER_IP-28 - [SQL values] {get_user_by_id-184} ('@testmensch_bob:example.org',)
2023-04-11 18:47:46,752 - synapse.storage.SQL - 452 - DEBUG - replication-USER_IP-28 - [SQL time] {get_user_by_id-184} 0.000376 sec
2023-04-11 18:47:46,753 - synapse.storage.txn - 814 - DEBUG - replication-USER_IP-28 - [TXN END] {get_user_by_id-184} 0.000987 sec
2023-04-11 18:47:46,756 - synapse.storage.SQL - 452 - DEBUG - POST-356 - [SQL time] {get_largest_public_rooms-183} 0.005281 sec
2023-04-11 18:47:46,756 - synapse.storage.txn - 814 - DEBUG - POST-356 - [TXN END] {get_largest_public_rooms-183} 0.005886 sec
2023-04-11 18:47:46,757 - synapse.storage.txn - 710 - DEBUG - POST-356 - [TXN START] {count_public_rooms-185}
2023-04-11 18:47:46,757 - synapse.storage.SQL - 426 - DEBUG - POST-356 - [SQL] {count_public_rooms-185} SELECT COUNT(*) FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) AND joined_members > 0
2023-04-11 18:47:46,757 - synapse.storage.SQL - 431 - DEBUG - POST-356 - [SQL values] {count_public_rooms-185} []
2023-04-11 18:47:46,758 - synapse.storage.SQL - 452 - DEBUG - POST-356 - [SQL time] {count_public_rooms-185} 0.001768 sec
2023-04-11 18:47:46,759 - synapse.storage.txn - 814 - DEBUG - POST-356 - [TXN END] {count_public_rooms-185} 0.002080 sec
2023-04-11 18:47:46,760 - synapse.access.http.8008 - 460 - INFO - POST-356 - ::1 - 8008 - {@testmensch_bob:example.org} Processed request: 0.017sec/0.001sec (0.003sec, 0.001sec) (0.002sec/0.012sec/4) 43B 200 "POST /_matrix/client/v3/publicRooms HTTP/1.1" "curl/7.74.0" [0 dbevts]

# room title = Öach, generic_search_term = ""
2023-04-11 18:48:15,041 - synapse.access.http.8008 - 415 - DEBUG - POST-472 - ::1 - 8008 - Received request: POST /_matrix/client/v3/publicRooms
2023-04-11 18:48:15,042 - synapse.handlers.room_list - 90 - INFO - POST-472 - Getting public room list: limit=100, since=None, search=True, network=ThirdPartyInstanceID(appservice_id=None, network_id=None)
2023-04-11 18:48:15,042 - synapse.handlers.room_list - 101 - INFO - POST-472 - Bypassing cache as search request.
2023-04-11 18:48:15,043 - synapse.storage.txn - 710 - DEBUG - POST-472 - [TXN START] {get_largest_public_rooms-1b9}
2023-04-11 18:48:15,043 - synapse.storage.SQL - 426 - DEBUG - POST-472 - [SQL] {get_largest_public_rooms-1b9} SELECT room_id, name, topic, canonical_alias, joined_members, avatar, history_visibility, guest_access, join_rules, room_type FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) AND joined_members > 0 ORDER BY joined_members DESC, room_id DESC LIMIT ?
2023-04-11 18:48:15,043 - synapse.storage.SQL - 431 - DEBUG - POST-472 - [SQL values] {get_largest_public_rooms-1b9} [101]
2023-04-11 18:48:15,045 - synapse.storage.SQL - 452 - DEBUG - POST-472 - [SQL time] {get_largest_public_rooms-1b9} 0.001752 sec
2023-04-11 18:48:15,045 - synapse.storage.txn - 814 - DEBUG - POST-472 - [TXN END] {get_largest_public_rooms-1b9} 0.002537 sec
2023-04-11 18:48:15,046 - synapse.storage.txn - 710 - DEBUG - POST-472 - [TXN START] {count_public_rooms-1ba}
2023-04-11 18:48:15,046 - synapse.storage.SQL - 426 - DEBUG - POST-472 - [SQL] {count_public_rooms-1ba} SELECT COUNT(*) FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) AND joined_members > 0
2023-04-11 18:48:15,046 - synapse.storage.SQL - 431 - DEBUG - POST-472 - [SQL values] {count_public_rooms-1ba} []
2023-04-11 18:48:15,049 - synapse.storage.SQL - 452 - DEBUG - POST-472 - [SQL time] {count_public_rooms-1ba} 0.002460 sec
2023-04-11 18:48:15,049 - synapse.storage.txn - 814 - DEBUG - POST-472 - [TXN END] {count_public_rooms-1ba} 0.002995 sec
2023-04-11 18:48:15,051 - synapse.access.http.8008 - 460 - INFO - POST-472 - ::1 - 8008 - {@testmensch_bob:example.org} Processed request: 0.009sec/0.001sec (0.000sec, 0.001sec) (0.001sec/0.006sec/2) 7839B 200 "POST /_matrix/client/v3/publicRooms HTTP/1.1" "curl/7.74.0" [0 dbevts]

Anything else that would be useful to know?

No response

subjugum commented 1 year ago

This seems to be an issue with Postgres LOWER() somehow not converting umlauts into lowercase properly (Ä -> ä, Ü -> ü, Ö -> ö).

I. e. SELECT room_id, name, topic, canonical_alias, joined_members, avatar, history_visibility, guest_access, join_rules, room_type FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) AND joined_members > 0 AND ( LOWER(name) LIKE '%Öach%') ORDER BY joined_members DESC, room_id DESC LIMIT 101; room_id name topic canonical_alias joined_members avatar history_visibility guest_access join_rules room_type
!pRruVxBDfrjmlCdIyF:example.org Öach Thema #testraum:example.org 3 shared can_join public

Will work as expected. Note the uppercase umlaut. Lowercasing the search arguments, as Synapse does (https://github.com/matrix-org/synapse/blob/v1.80.0/synapse/storage/databases/main/room.py#L447-L460), will not return anything. The database was created as per Synapse docs (https://matrix-org.github.io/synapse/latest/postgres.html#set-up-database) and I'm assuming that setting --locale=C is the culprit here. Would simply setting de_DE.utf8 fix the issue and would anything break within Synapse?

reivilibre commented 1 year ago

Would simply setting de_DE.utf8 fix the issue and would anything break within Synapse?

It may fix this particular thing, but I'm afraid that we heavily discourage non-C locales in Synapse because non-C locales change sorting order in different C standard library versions, effectively rendering your database corrupt if you don't take care to reindex your database as soon as you upgrade C standard library.

I think it'd be fair to say that searching is a pain point in Synapse currently. It's likely that room search should use the same database-provided full-text search mechanism as the user directory and room message search do, but these are still not without flaws.

subjugum commented 1 year ago

Reading https://www.postgresql.org/docs/current/locale.html it seems LC_COLLATE (which affects ordering) and LC_CTYPE (which affects string classification, such as upper-/lowercasing non-ASCII) can be set separately. Reading further at https://www.postgresql.org/docs/current/collation.html it can be set per column. Would that help here and would it still be affected by libc upgrades?

reivilibre commented 1 year ago

If LC_CTYPE doesn't affect ordering at all, then it sounds like that may be a possible workaround. I can't confirm this from experience though, I've never tried.

But I will note, from your first link:

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

I guess there may be traps you need to watch out for.

subjugum commented 1 year ago

You can actually do something like LOWER(name collate "en_US.utf8") LIKE '%öach%' (as per second link; en-x-icu works, too) and it will return the right result.

This obviously requires the en_US.utf8 locale to be installed or libicu, which seems to be everywhere anyway.