有關Solr 5.x搭配HanLP分詞器有關同義詞配置與簡繁轉換設定

vincentchien commented 8 years ago

作者，您好！本人日前配置一套測試環境，其配置如下： Solr 5.2.1 + HanLP(hanlp-solr-plugin-1.0.3.jar,hanlp-portable-1.2.9.jar)+Tomcat 7.0.68+CentOS 6.7 有關hanlp.properties的配置應該是正常，因為使用Solr Admin GUI的analysis命令輸入任何一段中文文章，如果是簡體字，則可以依據辭典來正常分詞，但是如果輸入繁體字，除了簡繁共用字有切分之外，其餘繁體中文字皆以單字切分，這個部分應該是詞典設定所造成，如果我加入繁體字詞典應該可以正常切分，問題來了：我將同義詞的詞典做了設定，使用analysis命令分析，畫面上卻沒有出現同義詞的分析結果，請問您有關此情況是不是我的設定有錯？補充一下：您的詞典有『云崖』，同義詞典也有定義『雲崖=云崖』，但是分析之後只會有『云崖』！

hankcs commented 8 years ago

你好，请参考https://github.com/hankcs/hanlp-solr-plugin#高级配置，开启精准繁体中文分词模式。

vincentchien commented 8 years ago

hankcs,您好！

謝謝您耐心指導，有一個小問題，您提供的網址並未有出現有關『同義詞』配置的說明？請問您：這一部份的設置在哪裡可以找到呢？另外有關

customDictionaryPath與stopWordDictionaryPath，這兩個參數在 hanlp.properties已經有配置了，是否就無須再配置呢？謝謝您的耐心協助

Vincent

2016-03-31 6:50 GMT+08:00 hankcs notifications@github.com:

你好，请参考https://github.com/hankcs/hanlp-solr-plugin#高级配置，开启精准繁体中文分词模式。

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-203672421

hankcs commented 8 years ago

xml可以覆盖properties中的配置，毕竟solr用户更习惯xml
properties中的停用词无法影响solr中的分词器，停用词建议使用solr自身的机制，比如参考solr的标准做法，在analyzer下面放一个停用词filter<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
同义词同上

vincentchien commented 8 years ago

您好！

谢谢您耐心的回覆。有关停用词与同义词的filter是否如下设定：

同义词: <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> 停用词: <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> 相关"synonyms.txt"与"stopwords.txt"皆放在conf路径下呢？

Vincent

2016-03-31 11:32 GMT+08:00 hankcs notifications@github.com:

xml可以覆盖properties中的配置，毕竟solr用户更习惯xml 2. properties中的停用词无法影响solr中的分词器，停用词建议使用solr自身的机制，比如参考solr的标准做法，在analyzer下面放一个停用词filter

同义词同上

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-203735980

vincentchien commented 8 years ago

您好！

抱歉又打扰您了！

我的schema.xml设定如下：

 似乎是不正确的设定，产生如下的错误讯息：

org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core core0: Plugin init failure for [schema.xml] fieldType "text": Plugin init failure for [schema.xml] analyzer/tokenizer: class com.hankcs.lucene.HanLPAnalyzer. Schema file is /datadisk/solr52/core0/conf/schema.xml

可否请您指导指导一下？感谢您！

Vincent

2016-03-31 16:42 GMT+08:00 簡文森 itgvincent@gmail.com:

您好！
谢谢您耐心的回覆。有关停用词与同义词的filter是否如下设定：
同义词: <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> 停用词: <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> 相关"synonyms.txt"与"stopwords.txt"皆放在conf路径下呢？

Vincent

2016-03-31 11:32 GMT+08:00 hankcs notifications@github.com:

xml可以覆盖properties中的配置，毕竟solr用户更习惯xml 2. properties中的停用词无法影响solr中的分词器，停用词建议使用solr自身的机制，比如参考solr的标准做法，在analyzer下面放一个停用词filter

同义词同上

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-203735980

hankcs commented 8 years ago

tokenizer class填错了，详见主页文档

vincentchien commented 8 years ago

hankcs,午安！

修正后的配置如下：

错误讯息如下：

core0: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core core0: Plugin init failure for [schema.xml] fieldType "text": Cannot load analyzer: com.hankcs.lucene.HanLPTokenizerFactory. Schema file is /datadisk/solr52/core0/conf/schema.xml

顺便向您请教一下： https://github.com/hankcs/hanlp-solr-plugin这个网址提供hanlp-solr-plugin.jar https://github.com/hankcs/hanlp-solr-plugin/releases的下载_hanlp-solr-plugin-1.1.0.zip https://github.com/hankcs/hanlp-solr-plugin/releases/download/v1.1.0/hanlp-solr-plugin-1.1.0.zip_ ，里面含有1.0.3和1.1.0两个jar档？只要放1.1.0这个jar档即可吗？

2016-03-31 21:01 GMT+08:00 hankcs notifications@github.com:

tokenizer class填错了，详见主页文档

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-203925395

hankcs commented 8 years ago

<fieldType name="text_cn" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="false"/>
    </analyzer>
</fieldType>

vincentchien commented 8 years ago

hankcs,您好！

想请问您是否有回信给我，因为前一封回信内容为空白，想跟您确认一下！谢谢

VIncent

2016-04-03 7:38 GMT+08:00 hankcs notifications@github.com:

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-204827421

hankcs commented 8 years ago

仔细看xml，analyzer的class不应该是HanLPTokenizerFactory，tokenizer的class才应该是这个。

vincentchien commented 8 years ago

hankcs,你好！

这样的配置对吗？因为我想使用停用词和同义词功能，不知这样配置是否正确？

<fieldType name="text" class="solr.TextField">

2016-04-06 10:54 GMT+08:00 hankcs notifications@github.com:

仔细看xml，analyzer的class不应该是HanLPTokenizerFactory，tokenizer的class才应该是这个。

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-206093516

hankcs commented 8 years ago

应该没问题

vincentchien commented 8 years ago

hankcs,你好！

系统出现：

core0: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core core0: Plugin init failure for [schema.xml] fieldType "text": Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'org.apache.lucene.analysis.core.StopFilterFactory'. Schema file is /datadisk/solr52/core0/conf/schema.xml

2016-04-06 11:11 GMT+08:00 hankcs notifications@github.com:

应该没问题

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-206096272

hankcs commented 8 years ago

跟StopFilterFactory有关的问题就超出了插件的范围了，请Google它的用法，或去solr的项目主页寻求帮助。

vincentchien commented 8 years ago

Hankcs,您好！

我调整后的配置如下：

经过solr admin GUI验证，同义词有做动，请问那停用词用HanLP插件的模式可行吗？

2016-04-06 11:24 GMT+08:00 hankcs notifications@github.com:

跟StopFilterFactory有关的问题就超出了插件的范围了，请Google它的用法，或去solr的项目主页寻求帮助。

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-206099005

hankcs commented 8 years ago

可行，两种方案任选一个。

vincentchien commented 8 years ago

hankcs，午安！

我将配置调整为如下：

7

使用Solr Admin GUI 的 Analyse验证，截图如下：出现了停用词与同义词的filter，似乎配置是可以运作，这里有一个疑问，我若要自订词库的话，每个词典的词性我该如何判断？

[image: 內置圖片 1]

2016-04-06 11:50 GMT+08:00 hankcs notifications@github.com:

可行，两种方案任选一个。

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-206107653

hankcs commented 8 years ago

https://github.com/hankcs/HanLP#词典说明

vincentchien commented 8 years ago

hankcs,您好！

想请问您『enableTraditionalChineseMode』这个配置是用来做什么的？另外『 enableNormalization』这个配置如果为『真』，是不是会将所有输入的繁体字转成对应的简体字，应该不会执行词的互转吧？例如：『辦公室』转成『办公室』，不会转成『写字楼』，除非同义词有设定，是吗？还盼您指点迷津，谢谢！

Vincent

2016-04-08 4:56 GMT+08:00 hankcs notifications@github.com:

https://github.com/hankcs/HanLP#词典说明

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-207084150

hankcs commented 8 years ago

enableTraditionalChineseMode=https://github.com/hankcs/HanLP/blob/master/src/test/java/com/hankcs/demo/DemoTraditionalChineseSegment.java enableNormalization=https://github.com/hankcs/HanLP/blob/master/src/test/java/com/hankcs/demo/DemoNormalization.java

vincentchien commented 8 years ago

Hankcs，您好！

又来请教您问题了，有关自定义词典，我目前依照网站文件上的说明配置，却无法正常运作，必须将自定义的词典放入CustomDictionary.txt才会正常分词，随信附上我的ˊ路径档案清单截图以及自定义词典档ALKeywords00.txt以及hanlp.properties两个档案内容，可否请您帮忙确认一下我的配置是否有误，谢谢！

<fieldType name="text" class="solr.TextField">

7

Vincent

[image: 內置圖片 1]

2016-04-16 11:35 GMT+08:00 hankcs notifications@github.com:

enableTraditionalChineseMode= https://github.com/hankcs/HanLP/blob/master/src/test/java/com/hankcs/demo/DemoTraditionalChineseSegment.java enableNormalization= https://github.com/hankcs/HanLP/blob/master/src/test/java/com/hankcs/demo/DemoNormalization.java

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-210726520

經濟成就 n 1 成就 n 1 經濟 n 1 免洗杯 n 1 一次性杯子 n 1 免洗筷 n 1 一次性筷子 n 1

hankcs commented 8 years ago

customDictionaryPath必须是txt

vincentchien commented 8 years ago

hankcs,早安！

您說的『customDictionaryPath必须是txt』，不太懂您的意思，可否請您解釋一下！謝謝

Vincent

2016-04-21 4:00 GMT+08:00 hankcs notifications@github.com:

customDictionaryPath必须是txt

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-212580527

hankcs commented 8 years ago

你给的是文件夹，我要的是文件

vincentchien commented 8 years ago

hankcs,您好！

有關customDictionaryPath路径下是您原有的词典档，我之前已经有附件给您，或许是邮件系统遗漏，不打紧，我再寄一份给您，我自订

的词典档ALKeywords00.txt连同我的hanlp.properties档也一并寄给您，您再帮我看看，谢谢！

Vincent

2016-04-21 8:41 GMT+08:00 簡文森 itgvincent@gmail.com:

hankcs,早安！

您說的『customDictionaryPath必须是txt』，不太懂您的意思，可否請您解釋一下！謝謝

Vincent

2016-04-21 4:00 GMT+08:00 hankcs notifications@github.com:

customDictionaryPath必须是txt

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-212580527

經濟成就 n 1 成就 n 1 經濟 n 1 免洗杯 n 1 一次性杯子 n 1 免洗筷 n 1 一次性筷子 n 1

vincentchien commented 8 years ago

hankcs,您好！

 请问一下，您有收到我寄送的附件吗？还是我误会您的意思寄错文件给您，如果有需要我补寄文件的话，再请您告诉我，谢谢！打扰之处，还请见谅！

Vincnet

簡文森 itgvincent@gmail.com 於 2016年4月21日上午9:40 寫道：

hankcs,您好！
有關customDictionaryPath路径下是您原有的词典档，我之前已经有附件给您，或许是邮件系统遗漏，不打紧，我再寄一份给您，我自订
的词典档ALKeywords00.txt连同我的hanlp.properties档也一并寄给您，您再帮我看看，谢谢！

Vincent

2016-04-21 8:41 GMT+08:00 簡文森 itgvincent@gmail.com:

hankcs,早安！

您說的『customDictionaryPath必须是txt』，不太懂您的意思，可否請您解釋一下！謝謝

Vincent

2016-04-21 4:00 GMT+08:00 hankcs notifications@github.com:

customDictionaryPath必须是txt

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-212580527

vincentchien commented 8 years ago

Hankcs,您好！

不好意思，又来打扰您，关于前几天向您请教有关自定义词典的问题，不知道您是否有收到我提供的附件资料，如果有不足的地方，再请您告知我，我再迅速补足，谢谢！

Vincent

hankcs commented 8 years ago

你好

你是否能够区分什么是文件，什么是文件夹。
无论是何种配置文件，都有明确或间接地说明customDictionaryPath参数接受的是文件！（多个词典用空格隔开）
你是否能够确认下你的customDictionaryPath指向的是否是文件

vincentchien commented 8 years ago

Hankcs,您好！

谢谢您的解答，打扰之处，还请您多多包涵见谅！

Vincent

2016-04-26 0:11 GMT+08:00 hankcs notifications@github.com:

Closed #1 https://github.com/hankcs/hanlp-solr-plugin/issues/1.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#event-640244951

vincentchien commented 8 years ago

Hankcs,您好！

我的配置档如下，的确指向词典文件的绝对路径，，因为我是直接以hanlp.properise的配置档复制过来，schema.xml配置档如下：

<fieldType name="text" class="solr.TextField">

7

应该没写错，我自定义的词典有附上，我输入『免洗筷』使用Solr管理介面使用analysis功能，切分成『免』，『洗』，『筷』，代表没有抓到词典以单字分词，我又试了『一次性筷子』，因为您原本的内建词典有『一次性』和『筷子』两个词，所以切分成『一次性』，『筷子』，目前遇到的困难只是我自定义的词典抓不到，我不知道还有哪边有错咧？我还需要提供什么样的讯息给您呢？

Vincent

2016-04-26 0:11 GMT+08:00 hankcs notifications@github.com:

你好

你是否能够区分什么是文件，什么是文件夹。

无论是何种配置文件，都有明确或间接地说明customDictionaryPath参数接受的是文件！（多个词典用空格隔开）

你是否能够确认下你的customDictionaryPath指向的是否是文件

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-214421859

經濟成就 n 1 成就 n 1 經濟 n 1 免洗杯 n 1 一次性杯子 n 1 免洗筷 n 1 一次性筷子 n 1

vincentchien commented 8 years ago

Hankcs,午安！

谢谢您的耐心指导，经过您的提示，我重新配置档之后，自定义词库已经可以正常被载入，在此向向您说声『谢谢』！

Vincent

2016-04-26 9:41 GMT+08:00 簡文森 itgvincent@gmail.com:

Hankcs,您好！
我的配置档如下，的确指向词典文件的绝对路径，，因为我是直接以hanlp.properise的配置档复制过来，schema.xml配置档如下：

<fieldType name="text" class="solr.TextField">
<analyzer type="index" enableIndexMode="true" class="com.hankcs.lucene.HanLPAnalyzer" />
7

应该没写错，我自定义的词典有附上，我输入『免洗筷』使用Solr管理介面使用analysis功能，切分成『免』，『洗』，『筷』，代表没有抓到词典以单字分词，我又试了『一次性筷子』，因为您原本的内建词典有『一次性』和『筷子』两个词，所以切分成『一次性』，『筷子』，目前遇到的困难只是我自定义的词典抓不到，我不知道还有哪边有错咧？我还需要提供什么样的讯息给您呢？

Vincent

2016-04-26 0:11 GMT+08:00 hankcs notifications@github.com:

你好

你是否能够区分什么是文件，什么是文件夹。

无论是何种配置文件，都有明确或间接地说明customDictionaryPath参数接受的是文件！（多个词典用空格隔开）

你是否能够确认下你的customDictionaryPath指向的是否是文件

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-214421859

vincentchien commented 8 years ago

Hankcs,您好！

又有问题要请教您了？我在词典里面，放入两个繁体词『抗體』和『免疫體』，使用solr的analysise的功能检视分词结果，分别竟然出现『抗』，『體』以及『免疫』與『體』，这样的情况您有遇过吗？有没有可以解决的方式呢？

Vincent

2016-04-26 16:22 GMT+08:00 簡文森 itgvincent@gmail.com:

Hankcs,午安！

谢谢您的耐心指导，经过您的提示，我重新配置档之后，自定义词库已经可以正常被载入，在此向向您说声『谢谢』！

Vincent

2016-04-26 9:41 GMT+08:00 簡文森 itgvincent@gmail.com:
Hankcs,您好！
我的配置档如下，的确指向词典文件的绝对路径，，因为我是直接以hanlp.properise的配置档复制过来，schema.xml配置档如下：

<fieldType name="text" class="solr.TextField">
<analyzer type="index" enableIndexMode="true" class="com.hankcs.lucene.HanLPAnalyzer" />
7

应该没写错，我自定义的词典有附上，我输入『免洗筷』使用Solr管理介面使用analysis功能，切分成『免』，『洗』，『筷』，代表没有抓到词典以单字分词，我又试了『一次性筷子』，因为您原本的内建词典有『一次性』和『筷子』两个词，所以切分成『一次性』，『筷子』，目前遇到的困难只是我自定义的词典抓不到，我不知道还有哪边有错咧？我还需要提供什么样的讯息给您呢？

Vincent

2016-04-26 0:11 GMT+08:00 hankcs notifications@github.com:

你好

你是否能够区分什么是文件，什么是文件夹。

无论是何种配置文件，都有明确或间接地说明customDictionaryPath参数接受的是文件！（多个词典用空格隔开）

你是否能够确认下你的customDictionaryPath指向的是否是文件

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-214421859

vincentchien commented 8 years ago

2016-04-26 18:30 GMT+08:00 簡文森 itgvincent@gmail.com:

Hankcs,您好！

日前我测试载入一个总量为732万的词典，却发生无法载入的情况，想请教您有關HanLP分詞器自定义词典数量是否有限制呢？如果有数量限制的话，有没有可以解决的方式？还请您指点一二。谢谢！

Vincent

hankcs commented 8 years ago

不值得大惊小怪，这些词属于新词，往自定义词典里加就行了。

vincentchien commented 8 years ago

Hankcs,您好！

在此向您致歉，应该是我描述的不清楚，『抗體』和『免疫體』这两个词我已经加入自定义词典，但是切分出来的结果分别竟然出现『抗』，『體』以及『免疫』與『體』，原以为是我的词典问题，之后又将这两个新词放入CustomDictionary.txt中试试看，切分结果依然是『抗』，『體』以及『免疫』與『體』，想请教您是否有些字会也特别处理呢？

Vincent

2016-04-29 4:35 GMT+08:00 hankcs notifications@github.com:

不值得大惊小怪，这些词属于新词，往自定义词典里加就行了。

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-215552262

hankcs commented 8 years ago

你应该加简体

vincentchien commented 8 years ago

hankcs,您好！

您說對了，简体的词是正常的切分，因为我的项目里需要同时接收繁体与简体的词，那如果要繁体也能切分，我该如何调整？

Vincent

2016-04-29 10:36 GMT+08:00 hankcs notifications@github.com:

你应该加简体

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-215614002

hankcs commented 8 years ago

加简体的免疫体到词典里，你繁体就也能出来，多简单啊

vincentchien commented 8 years ago

hankcs,您好！

所以您的建议是以简体词为词典主体，然后以同义词方式将繁体的词对应至简体词，然后用户输入简体词，就可以将简繁体词一并取出？是吗？

Vincent

2016-04-29 11:13 GMT+08:00 hankcs notifications@github.com:

加简体的免疫体到词典里，你繁体就也能出来，多简单啊

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-215617298

vincentchien commented 8 years ago

hankcs,您好！

你指的是将enableNormalization设置为true，全部将繁体转简体，再搭配以同义词方式将繁体的词对应至简体词，然后用户输入简体词，

就可以将简繁体词相关文件一并取出？对吧？

Vincent

2016-04-29 11:21 GMT+08:00 簡文森 itgvincent@gmail.com:

hankcs,您好！
所以您的建议是以简体词为词典主体，然后以同义词方式将繁体的词对应至简体词，然后用户输入简体词，就可以将简繁体词一并取出？是吗？
Vincent

2016-04-29 11:13 GMT+08:00 hankcs notifications@github.com:

加简体的免疫体到词典里，你繁体就也能出来，多简单啊

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/hankcs/hanlp-solr-plugin/issues/1#issuecomment-215617298

hankcs / hanlp-lucene-plugin

有關Solr 5.x搭配HanLP分詞器有關同義詞配置與簡繁轉換設定 #1

core0: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core core0: Plugin init failure for [schema.xml] fieldType "text": Cannot load analyzer: com.hankcs.lucene.HanLPTokenizerFactory. Schema file is /datadisk/solr52/core0/conf/schema.xml