Open 1l0 opened 4 years ago
Hey, I don't have any experience with CJK sentences. Do you have any suggestions on how eliasdb could support this? Maybe a config option for eliasdb.config.json
which let's you define a list of "separator" characters?
If we look at the introduction of Ruby in Japanese here: https://www.ruby-lang.org/ja/, we see this:
オープンソースの動的なプログラミング言語で、 シンプルさと高い生産性を備えています。 エレガントな文法を持ち、自然に読み書きができます。
Spaces, nor anything else is used at all to separate the words, We only have the comma 、 and the end of sentence 。. In CJK languages the reader has to find the word boundaries based on grammar or dictionaries. So defining a list of separator characters will not solve this. Rather, EliasDB should be extended to make it possible to look for non-delimited sub strings, something which is generally useful.
Another solution is to use a CJK text segregation library. I just found one for Go:
This requires stemming to do CJK
bleve has some of these Gae also looks good
CJK sentences are not separated by spaces. For now eliasdb can't handle an attempt which intended to search a specific word in some sentence in CJK. It would be great to be able to do that.