tdlib / td

Cross-platform library for building Telegram clients
https://core.telegram.org/tdlib
Boost Software License 1.0
6.97k stars 1.43k forks source link

Searching does not work properly for CJK ideographs #1004

Open nevikw39 opened 4 years ago

nevikw39 commented 4 years ago

Hello,

Telegram's searching ability is poor when it comes to Chinese-Japanese-Korean ideographs, which leads to difficulty in promoting it around Taiwan.

I tried to find out the cause. I took a look in MessagesDb.cpp and find that Telegram uses SQLite to restore messages and FTS5 module to make a search table.

And that is the point. FTS5 splits string into phrases, putting them into hash table. Suppose there is a text "Telegram search". Only "Telegram" and "search" would match the text, whereas either "Tele" or "a" would get no result. Unfortunately, Chinese characters are all categorized into "Letter", which is considered to be token. Hence, the whole Chinese text like "我好想要中文搜尋", containing consecutive Chinese chars without any delimiter, would be viewed as a single phrase. That is, none of "想要", "中文" or "搜尋" would match the result.

I have two ideas. The simple one, we can insert invisible separator such as '\a' between every Chinese char. The other one, we may implement a custom tokenizer.

Nonetheless, I can hardly realize what MessagesDb.cpp works. Actually I don't know how Telegram performs search tasks or how _searchid is generated.

So, how can we solve this problem? I would like to make my efforts to contribute to Telegram.

Thanks.

levlam commented 4 years ago

You have found a client-side search, which is enabled only for secret chat messages. The best way to improve it is to contibute directly to SQLite's FTS extension. Search for messages in all other chats is done server-side, so there is no way to improve it on TDLib's side.

nevikw39 commented 4 years ago

OK I see.

So, there is no way to check out Telegram server side code?

levlam commented 4 years ago

No.

kouhe3 commented 4 years ago

The search of telegram is based on "word", and the interval of "word" is punctuation or space. This is an English based search method, which is very convenient for English search. For example, "hello" can't be found by "he", and "hello" must be used. This is in line with the English context. When I want to find "he" messages, I don't want to see "hello" messages. But this way is not convenient for Chinese and other languages. Chinese is based on Chinese character

https://congcong0806.github.io/2019/11/04/TelegramSearch/

nathancchu commented 2 years ago

any updates on this? cannot effectively searching CJK characters is a huge pain using Telegram

tylvn commented 2 years ago

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

githubhjs commented 1 year ago

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

Actually, I found lots of CJK users on Telegram. But the search issue is limiting the number to grow.

devuterian commented 7 months ago

Still waiting for fix... this is important

tonytonyjan commented 3 months ago

I was eager to have this feature before.

Now I eventually switched to Discord with my friends.