Does not search Chinese, Japanese characters

Aspire1Inspire2 commented 5 years ago

[x] I have searched open and closed issues for duplicates
[x] I am submitting a bug report for existing functionality that does not work as intended
[x] I have read https://github.com/signalapp/Signal-Android/wiki/Submitting-useful-bug-reports
[x] This isn't a feature request or a discussion topic

Bug description

The new signal search feature does not search Asian Characters. It only search sentence.

Steps to reproduce

Have some complete sentences made of Asian Characters.
search for individual character
search returns no result.

Actual result:

search for the whole sentence
search return matching result.
search return no result for characters or combination of characters

Expected result: Search should return result for any combination of asian characters. If user searches "word1 word2" in asian characters, it is better to match any conversation where both word1 and word2 appears.

Device info

I beleieve this issue is general for all signal apps because it is built to search only roman words determined by spaces, periods, commas, etc.

Link to debug log

NA, since it is inherent limitation of the search feature implementation.

Aspire1Inspire2 commented 5 years ago

With the latest update, it still doesn't search asian characters, only the first characters combination in a sentence. It still recognize whatever between commas or periods as a single word. Asian characters are not formed this way.

Aspire1Inspire2 commented 5 years ago

Issue still persist.

Aspire1Inspire2 commented 5 years ago

Issue still persist.

Aspire1Inspire2 commented 3 years ago

Issue still persist to this date.

MewX commented 3 years ago

And, English inside Chinese sentences can't be searched.

E.g. 啦啦啦abc啦啦啦 You won't find "abc" in search history.

o3661606 commented 3 years ago

Does it related to the FTS (full-text search) engine of sqlite?

Aspire1Inspire2 commented 3 years ago

Does it related to the FTS (full-text search) engine of sqlite?

Not sure what signal uses as backend database or how the search works, I have never read any source code.

But it seems to me that signal only search word defined/delimited by spaces. Asian language does not delimit word by space, but by human Brian.

stale[bot] commented 2 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

MewX commented 2 years ago

Yes, it is still relevant. I have the same feeling that the blocker is current token parsing logic for non-space-split languages.

stale[bot] commented 2 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

nathancchu commented 2 years ago

It's still relevant and not fixed yet

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

MewX commented 2 years ago

Still need a fix here.

cschanaj commented 2 years ago

My guess is that SQLCipher has removed ICU support back in 2016 and the built-in SQLite tokenize methods (both porter and simple) does not support CJK. Besides, it seems that fts5 does not support CJK without building with ICU.

Dropping my research here in case someone is looking. A possible roadmap to fixing this could be:

[ ] Confirm SQLCipher tokenize=icu support, or use a custom build with ICU support (as part of SQLCipher Android Refresh; other options are
- use fts5 tokenize=icu or fts4 tokenize=unicode61
- ~use the workaround described in this post~
[ ] Tokenize search query with BreakIteratorCompat in SearchDatabase.java
[ ] Create new search database with fts4 tokenize=icu (if fts5 does not support CJK)
[ ] Migrate existing virtual tables in SignalDatabaseMigrations.kt

IMHO, search support for CJK (non-ascii) characters is critical to Signal as a secure yet privacy preserving alternative to other messaging platform. Not sure why a fix is staled for so many years.

cschanaj commented 2 years ago

@signalapp full text search is not working as what users might expect for CJK (non-ascii) characters. This significantly impact the initiatives for non-westerners to move from alternatives which are potentially less secure/ violating user privacy. Is there some resource constraint or technical reasons this issue is marked as wont fix?

logico-philosophical commented 1 year ago

A small example should help:

(English) All human beings are born free and equal in dignity and rights.
(Simplified Chinese) 人人生而自由，在尊严和权利上一律平等。
(Japanese) すべての人間は、生れながらにして自由であり、かつ、尊厳と権利とについて平等である。
(Korean) 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.

Search for rights/权利/権利/권리 in each sentences. The search doesn't work for simplified Chinese and Japanese versions. Although the search does work for the Korean version, this can be a problem for Korean too because Koreans may omit a lot of spaces when sending informal messages.

It seems that the search fails because the full-text search functionality of the SQLite FTS5 table can't search for something in the middle of a text not prefixed by delimiters like spaces. For example, if you search instead for 尊厳 in the Japanese text above the search does work, beause 尊厳 is prefixed by the delimiter 、. Since Chinese and Japanese languages rarely use spaces, most of the time the search doesn't work for these languages.

The following pieces of code seem to be responsible for this behavior:

https://github.com/signalapp/Signal-Android/blob/3869de414fa882bf4b53473f45e22cf6b3a2e3d8/app/src/main/java/org/thoughtcrime/securesms/database/SearchTable.kt#L55-L76

Note that SQLite's MATCH operator is being used to perform the search.

https://github.com/signalapp/Signal-Android/blob/3869de414fa882bf4b53473f45e22cf6b3a2e3d8/app/src/main/java/org/thoughtcrime/securesms/database/SearchTable.kt#L151-L165

The createFullTextSearchQuery function transforms search queries like 尊厳権利 into "尊厳"* "権利"*, which in turn produces SQL queries like SELECT ... WHERE $MMS_FTS_TABLE_NAME MATCH '"尊厳"* "権利"*' AND ....

keikhcheung commented 8 months ago

The problem still exists on Android version 6.42.3. I have noticed that if the word is at the start of a sentence, right after a space, or right after a punctuation mark (full-width or half-width), the word can be found.

If the word is in the middle of a sentence without preceding spaces or punctuation marks, the word won't be returned as a matching result.

This confirms what @logico-philosophical wrote above on 9 Jan 2023.

This same issue also exists on iOS (version 6.54). On Signal Desktop for Mac (version 6.42.1), it is working without any problem, though.

signalapp / Signal-Android