Tokenization of Chinese sentences

scambier / obsidian-omnisearch

A search engine that "just works" for Obsidian. Supports OCR and PDF indexing.

GNU General Public License v3.0

1.22k stars 63 forks source link

Tokenization of Chinese sentences #33

Closed Quorafind closed 2 years ago

Quorafind commented 2 years ago

Problem description:

1、Search any word in Non-En language especially CJK language sentence , only words at the beginning of a sentence can be retrieved. like

这里利用的是维基百科的结构化知识，加上后两者足够有效精确的知识导向。尝试用一个番茄钟内得到的知识输出成一个很小的文档或者思维导图。然后根据里面的内容利用费曼技巧进行学习。

when I wanted to search the 精确，nothing showed.

Your environment:

Omnisearch version: 1.0.1
Obsidian version: 0.14.7
Operating system: win10
Number of notes in your vault (approx.): 1600
Other plugins that may be related to the issue: None

scambier commented 2 years ago

Unfortunately, since Chinese words aren't separated by spaces, Omnisearch's tokenization can't work on Chinese sentences.

The easiest way around it would be to use a dictionary to correctly tokenize sentences (maybe with this module). But that should be done through an optional dependency for the plugin, since the dictionary is quite heavy.

Edit - if Chinese native speakers wish to chime in and provide input on how to solve this, they're welcome.

Quorafind commented 2 years ago

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter I think this would help.

scambier commented 2 years ago

@Quorafind Just tried it quickly with your example, could you tell me if this tokenization is correct?

这里 | 利用 | 的是 | 维基 | 百科 | 的 | 结构 | 化 | 知识 | ， | 加上 | 后 | 两者 | 足够 | 有效 | 精确 | 的 | 知识 | 导 | 向 | 。 | 尝试 | 用 | 一个 | 番茄 | 钟 | 内 | 得到 | 的 | 知识 | 输出 | 成 | 一个 | 很小 | 的 | 文 | 档 | 或者 | 思维 | 导 | 图 | 。 | 然后 | 根据 | 里面 | 的 | 内容 | 利用 | 费 | 曼 | 技巧 | 进行 | 学习 | 。

Quorafind commented 2 years ago

It is right, and only small part of it has small problem. Like 费曼 should be a word rather than 费|曼, because it is a name. but it works good enough and don't need to install a big package I think.

scambier commented 2 years ago

Fixed with PR #37, thanks for your help.