Open subjugum opened 1 year ago
This seems to be an issue with Postgres LOWER() somehow not converting umlauts into lowercase properly (Ä -> ä, Ü -> ü, Ö -> ö).
I. e.
SELECT room_id, name, topic, canonical_alias, joined_members, avatar, history_visibility, guest_access, join_rules, room_type FROM ( SELECT room_id FROM rooms WHERE is_public ) published INNER JOIN room_stats_state USING (room_id) INNER JOIN room_stats_current USING (room_id) WHERE ( join_rules = 'public' OR join_rules = 'knock' OR join_rules = 'knock_restricted' OR history_visibility = 'world_readable' ) AND joined_members > 0 AND ( LOWER(name) LIKE '%Öach%') ORDER BY joined_members DESC, room_id DESC LIMIT 101; |
room_id | name | topic | canonical_alias | joined_members | avatar | history_visibility | guest_access | join_rules | room_type |
---|---|---|---|---|---|---|---|---|---|---|
!pRruVxBDfrjmlCdIyF:example.org | Öach | Thema | #testraum:example.org | 3 | shared | can_join | public |
Will work as expected. Note the uppercase umlaut. Lowercasing the search arguments, as Synapse does (https://github.com/matrix-org/synapse/blob/v1.80.0/synapse/storage/databases/main/room.py#L447-L460), will not return anything. The database was created as per Synapse docs (https://matrix-org.github.io/synapse/latest/postgres.html#set-up-database) and I'm assuming that setting --locale=C
is the culprit here. Would simply setting de_DE.utf8
fix the issue and would anything break within Synapse?
Would simply setting de_DE.utf8 fix the issue and would anything break within Synapse?
It may fix this particular thing, but I'm afraid that we heavily discourage non-C locales in Synapse because non-C locales change sorting order in different C standard library versions, effectively rendering your database corrupt if you don't take care to reindex your database as soon as you upgrade C standard library.
I think it'd be fair to say that searching is a pain point in Synapse currently. It's likely that room search should use the same database-provided full-text search mechanism as the user directory and room message search do, but these are still not without flaws.
Reading https://www.postgresql.org/docs/current/locale.html it seems LC_COLLATE (which affects ordering) and LC_CTYPE (which affects string classification, such as upper-/lowercasing non-ASCII) can be set separately. Reading further at https://www.postgresql.org/docs/current/collation.html it can be set per column. Would that help here and would it still be affected by libc upgrades?
If LC_CTYPE
doesn't affect ordering at all, then it sounds like that may be a possible workaround. I can't confirm this from experience though, I've never tried.
But I will note, from your first link:
The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.
I guess there may be traps you need to watch out for.
You can actually do something like LOWER(name collate "en_US.utf8") LIKE '%öach%'
(as per second link; en-x-icu
works, too) and it will return the right result.
This obviously requires the en_US.utf8 locale to be installed or libicu, which seems to be everywhere anyway.
Description
When searching for a room in the public room directory via Element, it will not appear when entering the room title. Manually scrolling down without entering any search terms will show the room, however. From my limited testing, this only seems to affect rooms with uppercase umlaut characters (
ÄÖÜ
). Room titles with said characters anywhere else will be found as expected, see examples below.Steps to reproduce
In Element
Results are the same for requests sent to either main process and generic_workers. Examples with curl:
Room title is
Apföl
Then change the room title to
Öach
However, removing the
generic_search_term
will give a hitHomeserver
Local test install
Synapse Version
1.80.0
Installation Method
Debian packages from packages.matrix.org
Database
PostgresSQL (13.9-0+deb11u1, same host), fresh install
Workers
Multiple workers
Platform
VM, Debian 11
Configuration
No response
Relevant log output
Anything else that would be useful to know?
No response