Open fengkx opened 4 years ago
hello @fangkx, I have written a tool to index Chinese entries and support full-text search.
It extracts all entries from database, segment title
and document
, then converting the segmented text to tsvector
and saving it to document_vectors
.
hello @fangkx, I have written a tool to index Chinese entries and support full-text search.
It extracts all entries from database, segment
title
anddocument
, then converting the segmented text totsvector
and saving it todocument_vectors
.
Greate job!
I am using zhparser to do full text search and here is my setup(docker-compose). By setting default_text_search_config= 'chinese'
in postgresql.conf, the default search config can be overridden. I am settled with this solution except I'm not fully satisfied with the current segment result of zhparser.
The first run may consume more than 1 GB of memory, please ensure you have enough memory.
My instance is hosted on a vps with small memory. So I can't give it a try for now. I read your code, It seem to be using the default dictionary of sego. What about the segment result compare to zhparser.
@fengkx I have not noticed any issue with segment result. It seems it pretty accurate.
I have tried using the solution above for some time, found tuning the segmentation and searching being a hard work.
Maybe, instead of adding built-in support for space splitted languages, we can abstract the search function as an interface, and let users of miniflux implement their own search strategy?
(My plan is to try to index and search with meilisearch.)
I have been paying attention to the miniflux Chinese search problem. The solution I am using now is postgres+zhparser. I have compared the original postgres with postgres+zhparser and pgroonga. When searching for the same keywords without modifying miniflux, postgres+zhparser displays the most entries. The original postgres has entries displayed, but pgroonga cannot search for entries. I'm wondering if miniflux can add other search engines to the extension. Note that I am not a professional web developer, I am just interested in researching various self-hosted applications.
Language such as Chinese, Japanese. There is not space between word. PostgreSQL's
to_tsvector
can't split it well by default.But we can set custom SEARCH CONFIGURATION in PostgreSQL. By using an extension we can split it well.
to_tsvector and tsquery both support a configuration in the first parmater.
For now we don't provide any configuration. eg. https://github.com/miniflux/miniflux/blob/master/storage/entry.go#L63
It is possible to determine what configuration to use by checking the unicode range?