vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.69k stars 593 forks source link

stem-annotation does not work for phrases - better documentation or support it #27659

Open kkraune opened 1 year ago

kkraune commented 1 year ago

Describe the bug Stemming of single terms / phrases is inconsistent / confusing - missing documentation

To Reproduce Using https://docs.vespa.ai/en/vespa-quick-start.html

This looks right - vespa query 'select * from music where album contains ({stem:false}"paneer")' tracelevel=3 language=en-US outputs

                            {
                                "message": "sc0.num0 search to dispatch: query=[album:paneer] timeout=9998ms offset=0 hits=10 groupingSessionCache=true sessionId=2cf3dee6-68ad-4ecd-85b8-107639e3c133.1688644506104.55.default grouping=0 :  restrict=[music]"
                            },
                            {
                                "message": "Current state of query tree: WORD[fromSegmented=false index=\"album\" origin=null segmentIndex=0 stemmed=true uniqueID=1 words=true]{\n  \"paneer\"\n}\n"
                            },

Trying a phrase: vespa query 'select * from music where album contains ({stem:false}"paneer butter masala")' tracelevel=3 language=en-US outputs

                            {
                                "message": "sc0.num0 search to dispatch: query=[album:'pan butter masala'] timeout=9998ms offset=0 hits=10 groupingSessionCache=true sessionId=2cf3dee6-68ad-4ecd-85b8-107639e3c133.1688644647056.56.default grouping=0 :  restrict=[music]"
                            },
                            {
                                "message": "Current state of query tree: SPHRASE[explicit=false index=\"album\" isFromQuery=true isFromUser=true locked=true rawWord=\"paneer butter masala\" stemmed=true uniqueID=1]{\n  WORD[fromSegmented=false index=\"album\" origin=null segmentIndex=0 stemmed=true words=true]{\n    \"pan\"\n  }\n  WORD[fromSegmented=false index=\"album\" origin=null segmentIndex=0 stemmed=true words=true]{\n    \"butter\"\n  }\n  WORD[fromSegmented=false index=\"album\" origin=null segmentIndex=0 stemmed=true words=true]{\n    \"masala\"\n  }\n}\n"
                            },

In this case, the {stem:false} annotation does not work - "paneer" is stemmed to "pan". I think this is because one cannot use stem on phrase:

vespa query 'select * from music where album contains ({stem:false}phrase("paneer", "butter", "masala"))' tracelevel=3 language=en-US

                            {
                                "message": "sc0.num0 search to dispatch: query=[album:\"pan butter masala\"] timeout=9997ms offset=0 hits=10 groupingSessionCache=true sessionId=2cf3dee6-68ad-4ecd-85b8-107639e3c133.1688644746395.58.default grouping=0 :  restrict=[music]"
                            },
                            {
                                "message": "Current state of query tree: PHRASE[explicit=true index=\"album\" uniqueID=1]{\n  WORD[fromSegmented=false index=\"album\" origin=null segmentIndex=0 stemmed=true words=true]{\n    \"pan\"\n  }\n  WORD[fromSegmented=false index=\"album\" origin=null segmentIndex=0 stemmed=true words=true]{\n    \"butter\"\n  }\n  WORD[fromSegmented=false index=\"album\" origin=null segmentIndex=0 stemmed=true words=true]{\n    \"masala\"\n  }\n}\n"
                            },

Per documentation, phrase takes no annotations, and that is consistent with the behavior above (assuming SPHRASE and PHRASE behaves the same).

So we must either document better that an implicit phrase cannot disable stemming, or change phrase operators to support stem: false

Vespa version 8.188.15

kkraune commented 1 year ago

For the record, queries with a bag of words (not intended as a phrase) can be written as

vespa query 'select * from music where {stem:false}userInput(@q)' q="paneer butter masala" tracelevel=3

and the stem-annotation works as intended:

                            {
                                "message": "sc0.num0 search to dispatch: query=[WEAKAND(100) default:paneer default:butter default:masala] timeout=9998ms offset=0 hits=10 groupingSessionCache=true sessionId=2cf3dee6-68ad-4ecd-85b8-107639e3c133.1688645538004.71.default grouping=0 :  restrict=[music]"
                            },
                            {
                                "message": "Current state of query tree: WEAKAND[N=100]{\n  WORD[fromSegmented=false index=\"default\" origin=\"(0 6)\" segmentIndex=0 stemmed=true uniqueID=1 words=true]{\n    \"paneer\"\n  }\n  WORD[fromSegmented=false index=\"default\" origin=\"(7 13)\" segmentIndex=0 stemmed=true uniqueID=2 words=true]{\n    \"butter\"\n  }\n  WORD[fromSegmented=false index=\"default\" origin=\"(14 20)\" segmentIndex=0 stemmed=true uniqueID=3 words=true]{\n    \"masala\"\n  }\n}\n"                            },