Japanese, Chinese and other non-ASCII searching not quite working

RickCogley commented 8 years ago

Hi - on a synapse homeserver running on postgresql 9.4, and connecting via vector, I'm having trouble searching Japanese.

If I enter:

バッタと鈴虫

(grasshopper and cricket) I get a search hit when I search バッタ but, not when I search 鈴虫.

It appears that the beginning of a post is ok, but, anywhere in the middle of the post is not searchable. I tried this on my own homeserver, and on the main matrix homeserver. Same result. Kent on the main forum also reproduced.

Is there a possibility to search using regex?

RickCogley commented 8 years ago

I don't know much about this but, this seems to be an extension people use to get full text search on Japaese: http://pgbigm.osdn.jp/pg_bigm_en-1-1.html

RickCogley commented 8 years ago

Someone on the mattermost forum mentions how to do it for postgresql. https://github.com/mattermost/platform/issues/2159

erikjohnston commented 8 years ago

The idea is very much that the search API will accept a "locale" option in the future that handles searching more intelligently in different languages. Unfortunately, this really requires the server knowing which languages are going to be used up front so it can apply the correct indices.

Is there a possibility to search using regex?

Alas not, as it wouldn't be possible to create any indices that would allow it.

RickCogley commented 8 years ago

hi @erikjohnston meanwhile, any way I can make it work for my situation - needing to index Japanese and English? --Rick

erikjohnston commented 8 years ago

I can't think of anything quick that will allow allow both japanese and english, supporting more than one at a time will require a bit of dev work

RickCogley commented 8 years ago

Japanese has English interspersed in many cases, so I wonder if just supporting Japanese would also get us the English support. Any good interim ideas on how to do just Japanese, then? --Rick

erikjohnston commented 8 years ago

Installing Japanese into postgres and then changing all instances of to_tsvector('english', ...) and to_tsquery('english', ...) in synapse/storage/{search, room}.py to point to the Japanese configuration should do it. (Though you may need to change any existing data in the event_search.vector column from 'english' to 'japanese' somehow)

RickCogley commented 8 years ago

Ok, so this would basically involve making a customized synapse, correct?

Also - Are there any advanced search operators?

RickCogley commented 8 years ago

Additionally, does sqlite have any inherent advantages in terms of this sort of problem?

RickCogley commented 8 years ago

@erikjohnston a couple more comments and points -

Thinking of matrix.org as a distributed system, where rooms exist across multiple servers, getting people who use CJK languages (no spaces between to delimit search tokens) to adopt it will be a challenge, if search does not "just work". I am wondering if this is practical or if indeed it will mess up other servers if these rooms with various languages in them are federated to other servers, which don't have the special setup to synapse or postgresql done.

My use case is, largely inviting my employees or clients to the rooms on my homeserver, and getting them to connect via vector or other clients to that home server, which would be set up with the appropriate settings and indexes.

For others doing more federation, it could be a challenge.

Just a thought.

RickCogley commented 8 years ago

Hi @erikjohnston, when you say "install Japanese into postgres" are you talking about the steps I linked above to do that?

Sincerely, Rick

erikjohnston commented 8 years ago

Hi @erikjohnston, when you say "install Japanese into postgres" are you talking about the steps I linked above to do that?

Those instructions probably work, but I haven't actually ever tried installing additional languages myself, I just know its possible :) If you do manage it would be great if you share some of the details!

RickCogley commented 8 years ago

Ok, of course.

Just got a server to run this locally, since VPSs are generally a bit space constrained for an organization expecting a lot of usage / busy rooms. Most of the entries will be in Japanese so I need to make the search work.

dkastl commented 7 years ago

I have the same issue with Japanese full text search, and it's a serious problem. I have learned about PGroonga (http://pgroonga.github.io/), which looks like a good solution.

However, I'm not sure if this would be a practical solution here, and where I had to make changes to make it work. It would be best, if PGroonga could be used if available.

proletarius101 commented 3 years ago

It shouldn't be very complex (you don't need solutions for C, J, and K respectively). A possible solution could be

solr: https://wiki.harvard.edu/confluence/display/LibraryStaffDoc/CJK+Full+Text+Search.
MeiliSearch: https://www.meilisearch.com/ Elasticsearch-style, whereas Elasticsearch itself doesn't have bulit-in CJK search support
manticoresearch: requires least configuration and supports mysql and pgsql

And what mattermost do with complicated way: https://docs.mattermost.com/install/i18n.html

Another way is to go with database-wise solutions:

mysql: ?
MariaDB: https://mariadb.com/kb/en/about-mroonga/
postgres: ?

BTW, I use CJK languages, so I'm able to help. P.P.S. it's still in the scope of Unicode for sure. It's just a non-Latin search problem

BLumia commented 1 year ago

Can confirm we still have issue searching CJK messages. I think we can remove "non-unicode" from the issue title since it's unicode, it's a non-Latin search problem.

luixxiul commented 1 year ago

The issue can be reproduced on Arabic, Hebrew (with symbols), and Hindi too.

panda2134 commented 1 year ago

This issue is still relevant. By the way, I don't think this is a minor issue, since the chat history searching functionality is essentially broken in every room using CJK languages. Many matrix rooms using CJK languages belong to the open-source community. Maybe I'm exaggerating, but having no way to search for historical messages makes matrix no better than mailing lists for CJK users, because even with a mailing list you can search with CJK characters (using grep).

panda2134 commented 1 year ago

I think implementing the full-text search with Postgres is not suitable for non-Latin languages. We should use things like Apache Solr instead.

However, if we insist on using Postgres for full-text search, then at least we can replace calls to to_tsvector with vectors pre-calculated in Python, using multilingual tokenizing libraries. Database migration might be required in this case.

bkil commented 1 year ago

Could we elevate the occurrence from O-Occasional to a higher level, considering that it should be impacting the majority of the world population?

https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers

matrix-org / synapse

Japanese, Chinese and other non-ASCII searching not quite working #901