Open Aspire1Inspire2 opened 5 years ago
With the latest update, it still doesn't search asian characters, only the first characters combination in a sentence. It still recognize whatever between commas or periods as a single word. Asian characters are not formed this way.
Issue still persist.
Issue still persist.
Issue still persist to this date.
And, English inside Chinese sentences can't be searched.
E.g. 啦啦啦abc啦啦啦 You won't find "abc" in search history.
Does it related to the FTS (full-text search) engine of sqlite?
Does it related to the FTS (full-text search) engine of sqlite?
Not sure what signal uses as backend database or how the search works, I have never read any source code.
But it seems to me that signal only search word defined/delimited by spaces. Asian language does not delimit word by space, but by human Brian.
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Yes, it is still relevant. I have the same feeling that the blocker is current token parsing logic for non-space-split languages.
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
It's still relevant and not fixed yet
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still need a fix here.
My guess is that SQLCipher has removed ICU support back in 2016 and the built-in SQLite tokenize
methods (both porter
and simple
) does not support CJK. Besides, it seems that fts5 does not support CJK without building with ICU.
Dropping my research here in case someone is looking. A possible roadmap to fixing this could be:
tokenize=icu
support, or use a custom build with ICU support (as part of SQLCipher Android Refresh; other options are
fts5 tokenize=icu
or fts4 tokenize=unicode61
BreakIteratorCompat
in SearchDatabase.java
SignalDatabaseMigrations.kt
IMHO, search support for CJK (non-ascii) characters is critical to Signal as a secure yet privacy preserving alternative to other messaging platform. Not sure why a fix is staled for so many years.
@signalapp full text search is not working as what users might expect for CJK (non-ascii) characters. This significantly impact the initiatives for non-westerners to move from alternatives which are potentially less secure/ violating user privacy. Is there some resource constraint or technical reasons this issue is marked as wont fix
?
A small example should help:
Search for rights/权利/権利/권리 in each sentences. The search doesn't work for simplified Chinese and Japanese versions. Although the search does work for the Korean version, this can be a problem for Korean too because Koreans may omit a lot of spaces when sending informal messages.
It seems that the search fails because the full-text search functionality of the SQLite FTS5 table can't search for something in the middle of a text not prefixed by delimiters like spaces. For example, if you search instead for 尊厳
in the Japanese text above the search does work, beause 尊厳
is prefixed by the delimiter 、
. Since Chinese and Japanese languages rarely use spaces, most of the time the search doesn't work for these languages.
The following pieces of code seem to be responsible for this behavior:
Note that SQLite's MATCH
operator is being used to perform the search.
The createFullTextSearchQuery
function transforms search queries like 尊厳 権利
into "尊厳"* "権利"*
, which in turn produces SQL queries like SELECT ... WHERE $MMS_FTS_TABLE_NAME MATCH '"尊厳"* "権利"*' AND ...
.
The problem still exists on Android version 6.42.3. I have noticed that if the word is at the start of a sentence, right after a space, or right after a punctuation mark (full-width or half-width), the word can be found.
If the word is in the middle of a sentence without preceding spaces or punctuation marks, the word won't be returned as a matching result.
This confirms what @logico-philosophical wrote above on 9 Jan 2023.
This same issue also exists on iOS (version 6.54). On Signal Desktop for Mac (version 6.42.1), it is working without any problem, though.
Bug description
The new signal search feature does not search Asian Characters. It only search sentence.
Steps to reproduce
Actual result:
Expected result: Search should return result for any combination of asian characters. If user searches "word1 word2" in asian characters, it is better to match any conversation where both word1 and word2 appears.
Device info
I beleieve this issue is general for all signal apps because it is built to search only roman words determined by spaces, periods, commas, etc.
Link to debug log
NA, since it is inherent limitation of the search feature implementation.