分词分出了空格 - Githubissues

mhfc007 commented 7 years ago

按照Readme.md配置

但是分词分出了 " " (空格) 也分出了 "的" 还有标点符号怎么样才能把这些词过滤掉呢?

hankcs commented 7 years ago

https://github.com/hankcs/HanLP/blob/master/src/test/java/com/hankcs/demo/DemoStopWord.java

xuxucode commented 6 years ago

配置 stopWordDictionaryPath 为 stopwords_hanlp.txt 之后，只能过滤掉一个空格，如果连续两个空格就会出现 [2020], 配置如下：

  <analyzer type="index">
    <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" 
      enableIndexMode="true" 
      stopWordDictionaryPath="/var/solr/stopwords_hanlp.txt"
    />
  </analyzer>

错误结果如图，中间出现[2020]，请问 “[2020]” 是什么字符？ hanlp_stopwords

尝试通过 solr.StopFilterFactory filter 来过滤字符，但是问题依旧，[20]或[2020]都过滤不了，配置如下：

  <analyzer type="index">
    <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory"
      enableIndexMode="true"
    />
    <filter class="solr.StopFilterFactory" 
      ignoreCase="true" 
      words="/var/solr/stopwords_hanlp.txt"
    />
  </analyzer>

最终导致“空格”成为索引最多的字符：

hanlp_stopwords_term

hankcs commented 6 years ago

分词的定义是将原文拆分为片段，不负责预处理。
分词必须分出空格，否则highlight会错位。这个准则同样适用于其他字符，如制表符、换行符等等。
如果不希望任何片段出现在index中，可以用停用词机制来实现。
20是十六进制的空格，要过滤它，停用词词典里应该敲空格，不应该敲20。
这些符号的词性一般标注为w，可以写代码自己过滤。以后可能会支持配置过滤特定词性，但这个功能太简单，没有多少动力去做。

hankcs / hanlp-lucene-plugin

分词分出了空格 #19