多文本分词，offset 有问题

boliza commented 6 years ago

我做了一个比较丑陋的修正

https://github.com/shikeio/elasticsearch-analysis-hanlp

kevindragon commented 6 years ago

fixed in issue #30

BTW @boliza 覆盖end方法，设置end offset即可。记录totalOffset在reset不写回0会导致一些问题，请见 issue #29

boliza commented 6 years ago

@kevindragon good 我试一下。我做了那个修正之后，确实发现 Hightlight 不对。 Good Job :+1:

boliza commented 6 years ago

在 es 中测试：

curl -XGET "http://127.0.0.1:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "hanlp-index",
  "text": [
    "中华人民共和国",
    "地大物博"
  ]
}'

分词结果：

{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "ns",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "nz",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "nz",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "n",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "nz",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "n",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "n",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "n",
            "position": 7
        },
        {
            "token": "地大物博",
            "start_offset": 1,
            "end_offset": 5,
            "type": "nz",
            "position": 9
        },
        {
            "token": "地大",
            "start_offset": 1,
            "end_offset": 3,
            "type": "nz",
            "position": 10
        }
    ]
}

从这上面来看依然 offset 是不对的

hankcs commented 6 years ago

目前插件给出去的offset是对的：

[0:7 1] 中华人民共和国/ns
[0:2 1] 中华/nz
[1:3 1] 华人/n
[2:4 1] 人民/n
[4:7 1] 共和国/n
[4:6 1] 共和/n
[0:4 1] 地大物博/i

如果solr中还有问题，欢迎继续提。

boliza commented 6 years ago

这是我刚才在 ES 索引文档的时候，出现的错误:

[2018-01-10T17:39:01,517][DEBUG][o.e.a.b.TransportShardBulkAction] [test][3] failed to execute bulk item (index) BulkShardRequest [[test][3]] containing [inde
x {[test][test][1], source[{
  "content":["中华人民共和国","地大物博"]
}]}]
java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=1,en
dOffset=5,lastStartOffset=4 for field 'content'
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:767) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1d
bb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3
f03bf207140b - sarowe - 2017-10-02 14:36:35]
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1d
bb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa54395442
4d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207
140b - sarowe - 2017-10-02 14:36:35]
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - 
sarowe - 2017-10-02 14:36:35]
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1462) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sar
owe - 2017-10-02 14:36:35]

在 ES 中对于 ["中华人民共和国","地大物博"] 这样的文本内容，地大物博 的 offset 需要加上之前词的 offset ，在上面的例子中也就是说 地大物博 的 start_offset 应该 8 ，否则就会出现上面的异常。这里也有一个讨论 https://github.com/elastic/elasticsearch/issues/27987

在 #28（同我 es plugin）中的修复，运行分词测试是能正常索引的，分词结果为：

{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "ns",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "nz",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "nz",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "n",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "nz",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "n",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "n",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "n",
            "position": 7
        },
        {
            "token": "地大物博",
            "start_offset": 8,
            "end_offset": 12,
            "type": "nz",
            "position": 8
        },
        {
            "token": "地大",
            "start_offset": 8,
            "end_offset": 10,
            "type": "nz",
            "position": 9
        }
    ]
}

难道是 ES 与 solr 的区别？

hankcs commented 6 years ago

我不太清楚ES的语法，"content":["中华","地大物博"]表示2篇文档吗？为什么这两篇文档的offset需要连续呢？

boliza commented 6 years ago

"content":["中华","地大物博"] 看作是一个 array 类型。

@hankcs 我之前的回复有更新，你看一下

hankcs commented 6 years ago

好的，明白了。那么插件要如何分辨reset的时候是在切分array类型中的文本，还是非array呢？前者不应该清零，后者则必须清零。

hankcs commented 6 years ago

我新提交了一个commit，麻烦在ES中测一下。我猜ES切换Array的value时不会调close，所以把totalOffset的清零放到了close里面。

boliza commented 6 years ago

好的。明天我会给你测试结果

boliza commented 6 years ago

@hankcs 不好意思，前几天生病了，一直没有验证。经过测试 offset 是对的，分词测试结果如下：

{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "ns",
      "position": 0
    },
    {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "nz",
      "position": 1
    },
    {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "nz",
      "position": 2
    },
    {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "n",
      "position": 3
    },
    {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "nz",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "n",
      "position": 5
    },
    {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "n",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "n",
      "position": 7
    },
    {
      "token": "地大物博",
      "start_offset": 8,
      "end_offset": 12,
      "type": "nz",
      "position": 8
    },
    {
      "token": "地大",
      "start_offset": 8,
      "end_offset": 10,
      "type": "nz",
      "position": 9
    }
  ]
}

hankcs commented 6 years ago

噢，祝早日康复，这个问题解决了。

hankcs / hanlp-lucene-plugin

多文本分词，offset 有问题 #27