vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.49k stars 587 forks source link

Segmented And behaviour with weakAnd for CJK languages #30558

Open jobergum opened 4 months ago

jobergum commented 4 months ago

With Lucene linguistics, we have better support for CJK languages than with the default linguistics implementation based on Apache OpenNLP. But, when testing I do see that we have some query rewriting that leads to recall issues with regards to segmented and (SAND).

For the query "How has the number of tourists in each city and town in Yamaguchi Prefecture changed over time?" 山口県の各市町村の観光客数は、どのように変化してきましたか?

search/?query=山口県の各市町村の観光客数は、どのように変化してきましたか?&language=ja&tracelevel=9&summary=my-debug-summary&yql=select%20*%20from%20sources%20text%20where%20userQuery()%20or%20id%20contains%20"doc-ja-104

Is parsed as follows

YQL+ query parsed: [
select * from text where ((default contains ({origin: {original: "\u5C71\u53E3\u770C\u306E\u5404\u5E02\u753A\u6751\u306E\u89B3\u5149\u5BA2\u6570\u306F", offset: 0, length: 14}, andSegmenting: true}phrase("\u5C71\u53E3", "\u53E3\u770C", "\u770C\u306E", "\u306E\u5404", "\u5404\u5E02", "\u5E02\u753A", "\u753A\u6751", "\u6751\u306E", "\u306E\u89B3", "\u89B3\u5149", "\u5149\u5BA2", "\u5BA2\u6570", "\u6570\u306F")) AND default contains ({origin: {original: "\u3069\u306E\u3088\u3046\u306B\u5909\u5316\u3057\u3066\u304D\u307E\u3057\u305F\u304B", offset: 0, length: 14}, andSegmenting: true}phrase("\u3069\u306E", "\u306E\u3088", "\u3088\u3046", "\u3046\u306B", "\u306B\u5909", "\u5909\u5316", "\u5316\u3057", "\u3057\u3066", "\u3066\u304D", "\u304D\u307E", "\u307E\u3057", "\u3057\u305F", "\u305F\u304B"))) OR id contains ({origin: {original: "doc-ja-104", offset: 0, length: 10}}phrase("doc", "ja", "104")))
OR{
  AND{
    SAND[isFromQuery=true isFromUser=true locked=true rawWord="山口県の各市町村の観光客数は" stemmed=false]{
      WORD[connectedItem=0 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=0 stemmed=false uniqueID=0 words=true]{
        "山口"
      }
      WORD[%id=0 connectedItem=1 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=1 stemmed=false uniqueID=0 words=true]{
        "口県"
      }
      WORD[%id=1 connectedItem=2 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=2 stemmed=false uniqueID=0 words=true]{
        "県の"
      }
      WORD[%id=2 connectedItem=3 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=3 stemmed=false uniqueID=0 words=true]{
        "の各"
      }
      WORD[%id=3 connectedItem=4 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=4 stemmed=false uniqueID=0 words=true]{
        "各市"
      }
      WORD[%id=4 connectedItem=5 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=5 stemmed=false uniqueID=0 words=true]{
        "市町"
      }
      WORD[%id=5 connectedItem=6 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=6 stemmed=false uniqueID=0 words=true]{
        "町村"
      }
      WORD[%id=6 connectedItem=7 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=7 stemmed=false uniqueID=0 words=true]{
        "村の"
      }
      WORD[%id=7 connectedItem=8 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=8 stemmed=false uniqueID=0 words=true]{
        "の観"
      }
      WORD[%id=8 connectedItem=9 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=9 stemmed=false uniqueID=0 words=true]{
        "観光"
      }
      WORD[%id=9 connectedItem=10 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=10 stemmed=false uniqueID=0 words=true]{
        "光客"
      }
      WORD[%id=10 connectedItem=11 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=11 stemmed=false uniqueID=0 words=true]{
        "客数"
      }
      WORD[%id=11 connectedItem=12 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=12 stemmed=false uniqueID=0 words=true]{
        "数は"
      }
    }
    SAND[isFromQuery=true isFromUser=true locked=true rawWord="どのように変化してきましたか" stemmed=false]{
      WORD[%id=12 connectedItem=13 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=0 stemmed=false uniqueID=0 words=true]{
        "どの"
      }
      WORD[%id=13 connectedItem=14 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=1 stemmed=false uniqueID=0 words=true]{
        "のよ"
      }
      WORD[%id=14 connectedItem=15 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=2 stemmed=false uniqueID=0 words=true]{
        "よう"
      }
      WORD[%id=15 connectedItem=16 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=3 stemmed=false uniqueID=0 words=true]{
        "うに"
      }
      WORD[%id=16 connectedItem=17 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=4 stemmed=false uniqueID=0 words=true]{
        "に変"
      }
      WORD[%id=17 connectedItem=18 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=5 stemmed=false uniqueID=0 words=true]{
        "変化"
      }
      WORD[%id=18 connectedItem=19 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=6 stemmed=false uniqueID=0 words=true]{
        "化し"
      }
      WORD[%id=19 connectedItem=20 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=7 stemmed=false uniqueID=0 words=true]{
        "して"
      }
      WORD[%id=20 connectedItem=21 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=8 stemmed=false uniqueID=0 words=true]{
        "てき"
      }
      WORD[%id=21 connectedItem=22 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=9 stemmed=false uniqueID=0 words=true]{
        "きま"
      }
      WORD[%id=22 connectedItem=23 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=10 stemmed=false uniqueID=0 words=true]{
        "まし"
      }
      WORD[%id=23 connectedItem=24 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=11 stemmed=false uniqueID=0 words=true]{
        "した"
      }
      WORD[%id=24 fromSegmented=true index="" origin="(15 29)" segmentIndex=12 stemmed=false uniqueID=0 words=true]{
        "たか"
      }
    }
  }
  SPHRASE[explicit=false index="id" isFromQuery=true isFromUser=true locked=true rawWord="doc-ja-104" stemmed=false]{
    WORD[fromSegmented=false index="id" origin=null segmentIndex=0 stemmed=false words=true]{
      "doc"
    }
    WORD[fromSegmented=false index="id" origin=null segmentIndex=0 stemmed=false words=true]{
      "ja"
    }
    WORD[fromSegmented=false index="id" origin=null segmentIndex=0 stemmed=false words=true]{
      "104"
    }
  }
}

]"

Shorter format

query=[OR (WEAKAND(100) (AND (SAND 山口 口県 県の の各 各市 市町 町村 村の の観 観光 光客 客数 数は) (SAND どの のよ よう うに に変 変化 化し して てき きま まし した たか))) (SAND id:doc id:ja id:104)] 

This leads to 0 documents retrieved (unless the trick I use here to retrieve it with a OR), this happens even if the segmented tokens are correct and the document contains at least one of the tokens (山口):

image

jobergum commented 4 months ago

If we rewrite 各市町村の観光客数は、どのように変化してきましたか? to 山口県の各市町村の観光客数は どのように変化してきましたか? by removing and replace with regular space, you get a weakAnd with two SAND arguments. Compared to a weakAnd with AND of two SAND operators.

{
  WEAKAND[N=100]{
    SAND[isFromQuery=true isFromUser=true locked=true rawWord="山口県の各市町村の観光客数は" stemmed=false]{
      WORD[connectedItem=0 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=0 stemmed=false uniqueID=0 words=true]{
        "山口"
      }
      WORD[%id=0 connectedItem=1 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=1 stemmed=false uniqueID=0 words=true]{
        "口県"
      }
      WORD[%id=1 connectedItem=2 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=2 stemmed=false uniqueID=0 words=true]{
        "県の"
      }
      WORD[%id=2 connectedItem=3 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=3 stemmed=false uniqueID=0 words=true]{
        "の各"
      }
      WORD[%id=3 connectedItem=4 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=4 stemmed=false uniqueID=0 words=true]{
        "各市"
      }
      WORD[%id=4 connectedItem=5 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=5 stemmed=false uniqueID=0 words=true]{
        "市町"
      }
      WORD[%id=5 connectedItem=6 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=6 stemmed=false uniqueID=0 words=true]{
        "町村"
      }
      WORD[%id=6 connectedItem=7 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=7 stemmed=false uniqueID=0 words=true]{
        "村の"
      }
      WORD[%id=7 connectedItem=8 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=8 stemmed=false uniqueID=0 words=true]{
        "の観"
      }
      WORD[%id=8 connectedItem=9 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=9 stemmed=false uniqueID=0 words=true]{
        "観光"
      }
      WORD[%id=9 connectedItem=10 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=10 stemmed=false uniqueID=0 words=true]{
        "光客"
      }
      WORD[%id=10 connectedItem=11 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=11 stemmed=false uniqueID=0 words=true]{
        "客数"
      }
      WORD[%id=11 fromSegmented=true index="" origin="(0 14)" segmentIndex=12 stemmed=false uniqueID=0 words=true]{
        "数は"
      }
    }
    SAND[isFromQuery=true isFromUser=true locked=true rawWord="どのように変化してきましたか" stemmed=false]{
      WORD[connectedItem=12 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=0 stemmed=false uniqueID=0 words=true]{
        "どの"
      }
      WORD[%id=12 connectedItem=13 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=1 stemmed=false uniqueID=0 words=true]{
        "のよ"
      }
      WORD[%id=13 connectedItem=14 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=2 stemmed=false uniqueID=0 words=true]{
        "よう"
      }
      WORD[%id=14 connectedItem=15 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=3 stemmed=false uniqueID=0 words=true]{
        "うに"
      }
      WORD[%id=15 connectedItem=16 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=4 stemmed=false uniqueID=0 words=true]{
        "に変"
      }
      WORD[%id=16 connectedItem=17 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=5 stemmed=false uniqueID=0 words=true]{
        "変化"
      }
      WORD[%id=17 connectedItem=18 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=6 stemmed=false uniqueID=0 words=true]{
        "化し"
      }
      WORD[%id=18 connectedItem=19 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=7 stemmed=false uniqueID=0 words=true]{
        "して"
      }
      WORD[%id=19 connectedItem=20 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=8 stemmed=false uniqueID=0 words=true]{
        "てき"
      }
      WORD[%id=20 connectedItem=21 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=9 stemmed=false uniqueID=0 words=true]{
        "きま"
      }
      WORD[%id=21 connectedItem=22 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=10 stemmed=false uniqueID=0 words=true]{
        "まし"
      }
      WORD[%id=22 connectedItem=23 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=11 stemmed=false uniqueID=0 words=true]{
        "した"
      }
      WORD[%id=23 fromSegmented=true index="" origin="(15 29)" segmentIndex=12 stemmed=false uniqueID=0 words=true]{
        "たか"
      }
    }
  }