infinilabs / analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.
Apache License 2.0
16.48k stars 3.27k forks source link

如何对中文古籍中的生僻字进行分词?一些属于Unicode扩展区汉字会被过滤掉 #1068

Open gwisdomroof opened 1 month ago

gwisdomroof commented 1 month ago

Description

在用IK分词器处理中文古籍时,发现它会自动过滤一些属于Unicode扩展区的生僻字,不知要如何解决?

Steps to reproduce

以字符串“习𮊸𨻸𰄊𰶃”为例,如下: 111

Expected behavior

期望这些汉字都能正确分词。

Environment

Versions: Elasticsearch 7.17.9(Docker)

yangzhongke commented 2 weeks ago

新PR已经解决这个问题,请更新 https://github.com/infinilabs/analysis-ik/pull/1071 请验证后close这个issue