krotik / eliasdb

EliasDB a graph-based database.
Mozilla Public License 2.0
994 stars 49 forks source link

Support CJK on Full Text Search #21

Open 1l0 opened 4 years ago

1l0 commented 4 years ago

CJK sentences are not separated by spaces. For now eliasdb can't handle an attempt which intended to search a specific word in some sentence in CJK. It would be great to be able to do that.

krotik commented 4 years ago

Hey, I don't have any experience with CJK sentences. Do you have any suggestions on how eliasdb could support this? Maybe a config option for eliasdb.config.json which let's you define a list of "separator" characters?

beoran commented 4 years ago

If we look at the introduction of Ruby in Japanese here: https://www.ruby-lang.org/ja/, we see this:

オープンソースの動的なプログラミング言語で、 シンプルさと高い生産性を備えています。 エレガントな文法を持ち、自然に読み書きができます。

Spaces, nor anything else is used at all to separate the words, We only have the comma 、 and the end of sentence 。. In CJK languages the reader has to find the word boundaries based on grammar or dictionaries. So defining a list of separator characters will not solve this. Rather, EliasDB should be extended to make it possible to look for non-delimited sub strings, something which is generally useful.

beoran commented 4 years ago

Another solution is to use a CJK text segregation library. I just found one for Go:

https://github.com/go-ego/gse

gedw99 commented 3 years ago

This requires stemming to do CJK

bleve has some of these Gae also looks good