Open RickCogley opened 8 years ago
I don't know much about this but, this seems to be an extension people use to get full text search on Japaese: http://pgbigm.osdn.jp/pg_bigm_en-1-1.html
Someone on the mattermost forum mentions how to do it for postgresql. https://github.com/mattermost/platform/issues/2159
The idea is very much that the search API will accept a "locale" option in the future that handles searching more intelligently in different languages. Unfortunately, this really requires the server knowing which languages are going to be used up front so it can apply the correct indices.
Is there a possibility to search using regex?
Alas not, as it wouldn't be possible to create any indices that would allow it.
hi @erikjohnston meanwhile, any way I can make it work for my situation - needing to index Japanese and English? --Rick
I can't think of anything quick that will allow allow both japanese and english, supporting more than one at a time will require a bit of dev work
Japanese has English interspersed in many cases, so I wonder if just supporting Japanese would also get us the English support. Any good interim ideas on how to do just Japanese, then? --Rick
Installing Japanese into postgres and then changing all instances of to_tsvector('english', ...)
and to_tsquery('english', ...)
in synapse/storage/{search, room}.py
to point to the Japanese configuration should do it. (Though you may need to change any existing data in the event_search.vector
column from 'english' to 'japanese' somehow)
Ok, so this would basically involve making a customized synapse, correct?
Also - Are there any advanced search operators?
Additionally, does sqlite have any inherent advantages in terms of this sort of problem?
@erikjohnston a couple more comments and points -
Thinking of matrix.org as a distributed system, where rooms exist across multiple servers, getting people who use CJK languages (no spaces between to delimit search tokens) to adopt it will be a challenge, if search does not "just work". I am wondering if this is practical or if indeed it will mess up other servers if these rooms with various languages in them are federated to other servers, which don't have the special setup to synapse or postgresql done.
My use case is, largely inviting my employees or clients to the rooms on my homeserver, and getting them to connect via vector or other clients to that home server, which would be set up with the appropriate settings and indexes.
For others doing more federation, it could be a challenge.
Just a thought.
Hi @erikjohnston, when you say "install Japanese into postgres" are you talking about the steps I linked above to do that?
Sincerely, Rick
Hi @erikjohnston, when you say "install Japanese into postgres" are you talking about the steps I linked above to do that?
Those instructions probably work, but I haven't actually ever tried installing additional languages myself, I just know its possible :) If you do manage it would be great if you share some of the details!
Ok, of course.
Just got a server to run this locally, since VPSs are generally a bit space constrained for an organization expecting a lot of usage / busy rooms. Most of the entries will be in Japanese so I need to make the search work.
I have the same issue with Japanese full text search, and it's a serious problem. I have learned about PGroonga (http://pgroonga.github.io/), which looks like a good solution.
However, I'm not sure if this would be a practical solution here, and where I had to make changes to make it work. It would be best, if PGroonga could be used if available.
It shouldn't be very complex (you don't need solutions for C, J, and K respectively). A possible solution could be
And what mattermost do with complicated way: https://docs.mattermost.com/install/i18n.html
Another way is to go with database-wise solutions:
BTW, I use CJK languages, so I'm able to help. P.P.S. it's still in the scope of Unicode for sure. It's just a non-Latin search problem
Can confirm we still have issue searching CJK messages. I think we can remove "non-unicode" from the issue title since it's unicode, it's a non-Latin search problem.
The issue can be reproduced on Arabic, Hebrew (with symbols), and Hindi too.
This issue is still relevant. By the way, I don't think this is a minor issue, since the chat history searching functionality is essentially broken in every room using CJK languages. Many matrix rooms using CJK languages belong to the open-source community. Maybe I'm exaggerating, but having no way to search for historical messages makes matrix no better than mailing lists for CJK users, because even with a mailing list you can search with CJK characters (using grep
).
I think implementing the full-text search with Postgres is not suitable for non-Latin languages. We should use things like Apache Solr instead.
However, if we insist on using Postgres for full-text search, then at least we can replace calls to to_tsvector
with vectors pre-calculated in Python, using multilingual tokenizing libraries. Database migration might be required in this case.
Could we elevate the occurrence from O-Occasional
to a higher level, considering that it should be impacting the majority of the world population?
Hi - on a synapse homeserver running on postgresql 9.4, and connecting via vector, I'm having trouble searching Japanese.
If I enter:
バッタと鈴虫
(grasshopper and cricket) I get a search hit when I search バッタ but, not when I search 鈴虫.
It appears that the beginning of a post is ok, but, anywhere in the middle of the post is not searchable. I tried this on my own homeserver, and on the main matrix homeserver. Same result. Kent on the main forum also reproduced.
Is there a possibility to search using regex?