Open jobergum opened 4 months ago
If we rewrite 各市町村の観光客数は、どのように変化してきましたか?
to 山口県の各市町村の観光客数は どのように変化してきましたか?
by removing 、
and replace with regular space, you get a weakAnd with two SAND arguments. Compared to a weakAnd with AND of two SAND operators.
{
WEAKAND[N=100]{
SAND[isFromQuery=true isFromUser=true locked=true rawWord="山口県の各市町村の観光客数は" stemmed=false]{
WORD[connectedItem=0 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=0 stemmed=false uniqueID=0 words=true]{
"山口"
}
WORD[%id=0 connectedItem=1 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=1 stemmed=false uniqueID=0 words=true]{
"口県"
}
WORD[%id=1 connectedItem=2 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=2 stemmed=false uniqueID=0 words=true]{
"県の"
}
WORD[%id=2 connectedItem=3 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=3 stemmed=false uniqueID=0 words=true]{
"の各"
}
WORD[%id=3 connectedItem=4 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=4 stemmed=false uniqueID=0 words=true]{
"各市"
}
WORD[%id=4 connectedItem=5 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=5 stemmed=false uniqueID=0 words=true]{
"市町"
}
WORD[%id=5 connectedItem=6 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=6 stemmed=false uniqueID=0 words=true]{
"町村"
}
WORD[%id=6 connectedItem=7 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=7 stemmed=false uniqueID=0 words=true]{
"村の"
}
WORD[%id=7 connectedItem=8 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=8 stemmed=false uniqueID=0 words=true]{
"の観"
}
WORD[%id=8 connectedItem=9 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=9 stemmed=false uniqueID=0 words=true]{
"観光"
}
WORD[%id=9 connectedItem=10 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=10 stemmed=false uniqueID=0 words=true]{
"光客"
}
WORD[%id=10 connectedItem=11 connectivity=1.0 fromSegmented=true index="" origin="(0 14)" segmentIndex=11 stemmed=false uniqueID=0 words=true]{
"客数"
}
WORD[%id=11 fromSegmented=true index="" origin="(0 14)" segmentIndex=12 stemmed=false uniqueID=0 words=true]{
"数は"
}
}
SAND[isFromQuery=true isFromUser=true locked=true rawWord="どのように変化してきましたか" stemmed=false]{
WORD[connectedItem=12 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=0 stemmed=false uniqueID=0 words=true]{
"どの"
}
WORD[%id=12 connectedItem=13 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=1 stemmed=false uniqueID=0 words=true]{
"のよ"
}
WORD[%id=13 connectedItem=14 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=2 stemmed=false uniqueID=0 words=true]{
"よう"
}
WORD[%id=14 connectedItem=15 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=3 stemmed=false uniqueID=0 words=true]{
"うに"
}
WORD[%id=15 connectedItem=16 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=4 stemmed=false uniqueID=0 words=true]{
"に変"
}
WORD[%id=16 connectedItem=17 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=5 stemmed=false uniqueID=0 words=true]{
"変化"
}
WORD[%id=17 connectedItem=18 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=6 stemmed=false uniqueID=0 words=true]{
"化し"
}
WORD[%id=18 connectedItem=19 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=7 stemmed=false uniqueID=0 words=true]{
"して"
}
WORD[%id=19 connectedItem=20 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=8 stemmed=false uniqueID=0 words=true]{
"てき"
}
WORD[%id=20 connectedItem=21 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=9 stemmed=false uniqueID=0 words=true]{
"きま"
}
WORD[%id=21 connectedItem=22 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=10 stemmed=false uniqueID=0 words=true]{
"まし"
}
WORD[%id=22 connectedItem=23 connectivity=1.0 fromSegmented=true index="" origin="(15 29)" segmentIndex=11 stemmed=false uniqueID=0 words=true]{
"した"
}
WORD[%id=23 fromSegmented=true index="" origin="(15 29)" segmentIndex=12 stemmed=false uniqueID=0 words=true]{
"たか"
}
}
}
With Lucene linguistics, we have better support for CJK languages than with the default linguistics implementation based on Apache OpenNLP. But, when testing I do see that we have some query rewriting that leads to recall issues with regards to segmented and (SAND).
For the query "How has the number of tourists in each city and town in Yamaguchi Prefecture changed over time?" 山口県の各市町村の観光客数は、どのように変化してきましたか?
Is parsed as follows
Shorter format
This leads to 0 documents retrieved (unless the trick I use here to retrieve it with a OR), this happens even if the segmented tokens are correct and the document contains at least one of the tokens (山口):