[Feature request] Add better fulltext search for non space splitted language

miniflux / v2

Minimalist and opinionated feed reader

https://miniflux.app

Apache License 2.0

7.01k stars 732 forks source link

[Feature request] Add better fulltext search for non space splitted language #663

Open fengkx opened 4 years ago

fengkx commented 4 years ago

Language such as Chinese, Japanese. There is not space between word. PostgreSQL's to_tsvector can't split it well by default.

But we can set custom SEARCH CONFIGURATION in PostgreSQL. By using an extension we can split it well.

split word

to_tsvector and tsquery both support a configuration in the first parmater.

For now we don't provide any configuration. eg. https://github.com/miniflux/miniflux/blob/master/storage/entry.go#L63

It is possible to determine what configuration to use by checking the unicode range?

QuantumGhost commented 3 years ago

hello @fangkx, I have written a tool to index Chinese entries and support full-text search.

It extracts all entries from database, segment title and document, then converting the segmented text to tsvector and saving it to document_vectors.

fengkx commented 3 years ago

hello @fangkx, I have written a tool to index Chinese entries and support full-text search.

It extracts all entries from database, segment title and document, then converting the segmented text to tsvector and saving it to document_vectors.

Greate job!

I am using zhparser to do full text search and here is my setup(docker-compose). By setting default_text_search_config= 'chinese' in postgresql.conf, the default search config can be overridden. I am settled with this solution except I'm not fully satisfied with the current segment result of zhparser.

The first run may consume more than 1 GB of memory, please ensure you have enough memory.

My instance is hosted on a vps with small memory. So I can't give it a try for now. I read your code, It seem to be using the default dictionary of sego. What about the segment result compare to zhparser.

QuantumGhost commented 3 years ago

@fengkx I have not noticed any issue with segment result. It seems it pretty accurate.

QuantumGhost commented 1 year ago

I have tried using the solution above for some time, found tuning the segmentation and searching being a hard work.

Maybe, instead of adding built-in support for space splitted languages, we can abstract the search function as an interface, and let users of miniflux implement their own search strategy?

(My plan is to try to index and search with meilisearch.)

aaro-n commented 1 year ago

I have been paying attention to the miniflux Chinese search problem. The solution I am using now is postgres+zhparser. I have compared the original postgres with postgres+zhparser and pgroonga. When searching for the same keywords without modifying miniflux, postgres+zhparser displays the most entries. The original postgres has entries displayed, but pgroonga cannot search for entries. I'm wondering if miniflux can add other search engines to the extension. Note that I am not a professional web developer, I am just interested in researching various self-hosted applications.