Closed Quorafind closed 2 years ago
Unfortunately, since Chinese words aren't separated by spaces, Omnisearch's tokenization can't work on Chinese sentences.
The easiest way around it would be to use a dictionary to correctly tokenize sentences (maybe with this module). But that should be done through an optional dependency for the plugin, since the dictionary is quite heavy.
Edit - if Chinese native speakers wish to chime in and provide input on how to solve this, they're welcome.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter I think this would help.
@Quorafind Just tried it quickly with your example, could you tell me if this tokenization is correct?
这里 | 利用 | 的是 | 维基 | 百科 | 的 | 结构 | 化 | 知识 | , | 加上 | 后 | 两者 | 足够 | 有效 | 精确 | 的 | 知识 | 导 | 向 | 。 | 尝试 | 用 | 一个 | 番茄 | 钟 | 内 | 得到 | 的 | 知识 | 输出 | 成 | 一个 | 很小 | 的 | 文 | 档 | 或者 | 思维 | 导 | 图 | 。 | 然后 | 根据 | 里面 | 的 | 内容 | 利用 | 费 | 曼 | 技巧 | 进行 | 学习 | 。
It is right, and only small part of it has small problem. Like 费曼 should be a word rather than 费|曼, because it is a name. but it works good enough and don't need to install a big package I think.
Fixed with PR #37, thanks for your help.
Problem description:
1、Search any word in Non-En language especially CJK language sentence , only words at the beginning of a sentence can be retrieved. like
when I wanted to search the
精确
,nothing showed.Your environment: