vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.46k stars 584 forks source link

Calculate occurrence phrase in text. #31671

Closed tarashchuk closed 1 day ago

tarashchuk commented 1 week ago

I use request select * from search where Name contains 'cat' for searching. How to calculate occurrence phrase in text? For example for string like 'The mighty cat of catsville' - it can be 2 and for 'Cats' it can be 1. Used index, attribute for field, used different type of function, like matchCount, fieldTermMatch(Name,0).occurrences and etc. Also split string to array but it not help.

bratseth commented 1 week ago

It's not clear what you want to achieve, could you amend with a clearer example of what you want than "it can be 2 and for Cats it can be 1.".

tarashchuk commented 1 week ago

I want, that some of function of ranking show me, how many searched phrase occurrence in string. 'The mighty cat of catsville' show me 2 for example.

tarashchuk commented 1 week ago
{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 3
        },
        "coverage": {
            "coverage": 100,
            "documents": 77,
            "full": true,
            "nodes": 1,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "id:search:search::tQtmaccKYmHkaaCy",
                "relevance": 1.0,
                "source": "search",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCy",
                    "name_array": [
                        "Cats"
                    ],
                    "Id": "tQtmaccKYmHkaaCy",
                    "Name": "Cats"
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCg",
                "relevance": 0.8945312499999999,
                "source": "search",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCg",
                    "name_array": [
                        "Cats",
                        "lover"
                    ],
                    "Id": "tQtmaccKYmHkaaCg",
                    "Name": "Cats lover"
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCh",
                "relevance": 0.7525,
                "source": "clan",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCh",
                    "name_array": [
                        "The",
                        "mighty",
                        "cat",
                        "of",
                        "catsville"
                    ],
                    "Id": "tQtmaccKYmHkaaCh",
                    "Name": "The mighty cat of catsville"
                }
            }
        ]
    }
}

Example of documents

bratseth commented 1 week ago

Thanks. Regular text matching is on the token (word) level, so while there are features that give you the number of matched occurrences, "catsville" will not be matched by "cat".

You can switch to matching substrings by changing the field definition to use gram matching.

tarashchuk commented 1 week ago

Thanks, can you plz give some example of realization for calculating occurrences? And what function i can use for it in ranking?

bratseth commented 1 week ago

I recommend using the nativeRank feature here - it takes into account the number of matches occurrences as well as proximity, which is important when using gram matching.

To be clear it won't give you the exact numbers you shared below, but should give you the same rank order.

tarashchuk commented 1 week ago

Thanks, can you plz share example? And how to get documents, when we have field name like 'Macatos'?

bratseth commented 1 week ago

Just add "match: gram" to the relevant fields in the schema, and use e.g

first-phase {
    expression: nativeRank
}

as the ranking expression. No other changes needed but if you already have data indexed with token matching you need to reindex (happens automatically on cloud, if you are hosting yourself you need ti trigger it, or just rewrite the data).

tarashchuk commented 1 week ago

Not working good, the ranking is not same.

field Name type string {
            indexing:  summary | index 
            match: gram
        }
 rank-profile searchByName {     
            first-phase {
                expression: nativeRank(Name)
            }
        summary-features {
            fieldTermMatch(Name,0).occurrences
            queryTermCount
            fieldMatch(Name).segments
            fieldMatch(Name).matches
            fieldMatch(Name).segmentDistance
            fieldMatch(Name).gaps
            textSimilarity(Name)
            elementSimilarity(Name)
            matchCount(Name)
            fieldLength(Name)
            fieldMatch(Name).absoluteOccurrence
        }
    }
{
    "hits": 125,
    "offset": 0,
    "ranking": {
        "profile": "searchByName"
    },
    // "trace":{
    //     "level" :"2"
    // },
    "yql": "select * from search where Name contains ('cat')"
}
{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 4
        },
        "coverage": {
            "coverage": 100,
            "documents": 77,
            "full": true,
            "nodes": 1,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "id:search:search::tQtmaccKYmHkaaCy",
                "relevance": 0.3343789820215161,
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCy",
                    "Id": "tQtmaccKYmHkaaCy",
                    "Name": "Cats",
                    "summaryfeatures": {
                        "elementSimilarity(Name)": 0.9333333333333333,
                        "fieldLength(Name)": 3.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 2.0,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.9333333333333333,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCg",
                "relevance": 0.3320460953155481,
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCg",
                    "Id": "tQtmaccKYmHkaaCg",
                    "Name": "Cats lover",
                    "summaryfeatures": {
                        "elementSimilarity(Name)": 0.8571428571428572,
                        "fieldLength(Name)": 7.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 2.0,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.8571428571428572,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCh",
                "relevance": 0.24596591494676392,
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCh",
                    "Id": "tQtmaccKYmHkaaCh",
                    "Name": "The mighty cat of catsville",
                    "summaryfeatures": {
                        "elementSimilarity(Name)": 0.8222222222222223,
                        "fieldLength(Name)": 18.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.02,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 2.0,
                        "matchCount(Name)": 2.0,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.8222222222222223,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCo",
                "relevance": 0.221442976434951,
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCo",
                    "Id": "tQtmaccKYmHkaaCo",
                    "Name": "Macatos",
                    "summaryfeatures": {
                        "elementSimilarity(Name)": 0.8666666666666667,
                        "fieldLength(Name)": 6.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 2.0,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.8666666666666667,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            }
        ]
    }
}
bratseth commented 1 week ago

Ok, fair enough. This is because the nativeRank formula takes multiple things into consideration, not just the number of occurrences, but also the length of the field, how early the occurrences are etc.

If you want the occurrences to matter more you can configure a linear boost per occurrence (occurrenceCountTable), and/or decrease the first occurrence importance (firstOccurrenceImportance). See https://docs.vespa.ai/en/reference/nativerank.html

Alternatively, if you don't want any traces of good relevance but just order by occurrences, you can use fieldMatch(Name).occurrence instead of nativeRank.

tarashchuk commented 1 week ago

fieldMatch(Name).occurrence give me very wierd result.

{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 4
        },
        "coverage": {
            "coverage": 100,
            "documents": 77,
            "full": true,
            "nodes": 1,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "id:search:search::tQtmaccKYmHkaaCy",
                "relevance": 0.9333333333333333,
                "source": "search",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCy",
                    "Id": "tQtmaccKYmHkaaCy",
                    "Name": "Cats",
                    "summaryfeatures": {
                        "bm25(Name)": 0.0,
                        "elementSimilarity(Name)": 0.9333333333333333,
                        "fieldLength(Name)": 3.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        fieldMatch(Name).occurrence": 0.6666666666666666,
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 2.0,
                        "nativeRank(Name)": 0.3343789820215161,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.9333333333333333,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCo",
                "relevance": 0.8666666666666667,
                "source": "search",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCo",
                    "Id": "tQtmaccKYmHkaaCo",
                    "Name": "Macatos",
                    "summaryfeatures": {
                        "bm25(Name)": 0.0,
                        "elementSimilarity(Name)": 0.8666666666666667,
                        "fieldLength(Name)": 6.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        **"fieldMatch(Name).occurrence": 0.3333333333333333,**
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 2.0,
                        "nativeRank(Name)": 0.221442976434951,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.8666666666666667,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCg",
                "relevance": 0.8571428571428572,
                "source": "search",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::tQtmaccKYmHkaaCg",
                    "Id": "tQtmaccKYmHkaaCg",
                    "Name": "Cats lover",
                    "summaryfeatures": {
                        "bm25(Name)": 0.0,
                        "elementSimilarity(Name)": 0.8571428571428572,
                        "fieldLength(Name)": 7.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        **"fieldMatch(Name).occurrence": 0.2857142857142857,**
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 2.0,
                        "nativeRank(Name)": 0.3320460953155481,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.8571428571428572,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            },
            {
                "id": "id:search:search::tQtmaccKYmHkaaCh",
                "relevance": 0.8222222222222223,
                "source": "search",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:cjsearch:cjsearch::tQtmaccKYmHkaaCh",
                    "Id": "tQtmaccKYmHkaaCh",
                    "Name": "The mighty cat of catsville",
                    "summaryfeatures": {
                        "bm25(Name)": 0.0,
                        "elementSimilarity(Name)": 0.8222222222222223,
                        "fieldLength(Name)": 18.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.02,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 2.0,
                        **"fieldMatch(Name).occurrence": 0.2222222222222222,**
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 2.0,
                        "matchCount(Name)": 2.0,
                        "nativeRank(Name)": 0.24596591494676392,
                        "queryTermCount": 2.0,
                        "textSimilarity(Name)": 0.8222222222222223,
                        "vespa.summaryFeatures.cached": 0.0
                    }
                }
            }
        ]
    }
bratseth commented 1 week ago

Occurrence factors in the length of the field.

You can use fieldMatch(Name).absoluteOccurrence, which is 1/100 of the number of occurrences.

tarashchuk commented 1 week ago
            {
                "id": "id:search:search::mAhEfYwbHmDvsxF",
                "relevance": 0.45,
                "source": "search",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::mAhEfYwbHmDvsxF",
                    "Id": "mAhEfYwbHmDvsxF",
                    "Name": "fewCats",
                    "summaryfeatures": {
                        "bm25(Name)": 0.0,
                        "elementSimilarity(Name)": 0.9,
                        "fieldLength(Name)": 6.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 3.0,
                        "fieldMatch(Name).occurrence": 0.5,
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 3.0,
                        "nativeRank(Name)": 0.21789950651775084,
                        "queryTermCount": 3.0,
                        "textSimilarity(Name)": 0.9,
                        "vespa.summaryFeatures.cached": 0.0
                    },
                    "Members": [
                        743656
                    ],
                    "PointsThreshold": 0,
                    "Language": "English",
                    "Country": "UA",
                    "TotalPoints": 23503,
                    "EstablishedTime": 1707987266,
                    "Type": 1
                }
            },
            {
                "id": "id:search:search::mAhEfYwbHmDvsxa",
                "relevance": 0.45,
                "source": "clan",
                "fields": {
                    "sddocname": "search",
                    "documentid": "id:search:search::mAhEfYwbHmDvsxa",
                    "Id": "mAhEfYwbHmDvsxa",
                    "Name": "Test Cats",
                    "summaryfeatures": {
                        "bm25(Name)": 0.0,
                        "elementSimilarity(Name)": 0.9,
                        "fieldLength(Name)": 6.0,
                        "fieldMatch(Name).absoluteOccurrence": 0.01,
                        "fieldMatch(Name).gaps": 0.0,
                        "fieldMatch(Name).matches": 3.0,
                        "fieldMatch(Name).occurrence": 0.5,
                        "fieldMatch(Name).segmentDistance": 0.0,
                        "fieldMatch(Name).segments": 1.0,
                        "fieldTermMatch(Name,0).occurrences": 1.0,
                        "matchCount(Name)": 3.0,
                        "nativeRank(Name)": 0.21789950651775084,
                        "queryTermCount": 3.0,
                        "textSimilarity(Name)": 0.9,
                        "vespa.summaryFeatures.cached": 0.0
                    },
                    "Members": [
                        743656
                    ],
                    "PointsThreshold": 0,
                    "Language": "English",
                    "Country": "UA",
                    "TotalPoints": 23503,
                    "EstablishedTime": 1707987266,
                    "Type": 1
                }
            },

I didn't understand, why for this documents we have "elementSimilarity(Name)": 0.9 same, also same "fieldLength(Name)": 6.0. Also can we have ability to calculate number of chars in string field? Request:

{
    "hits": 125,
    "offset": 0,
    "ranking": {
        "profile": "searchByName"
    },
    // "trace":{
    //     "level" :"2"
    // },
    "yql": "select * from cjsearch where  Name contains 'cats'"
}
bratseth commented 1 week ago

Fieldlength here is in number of tokens. Since this is n-gram there are more tokens than words. Same for elementSimilarity I think, if you don't see why after reading the doc please point out exactly what you would expect.

There is no rank feature for the length of a field in characters but you can create a github issue on it.

jobergum commented 1 week ago

It can also maybe help starting the the text search tutorial which has a section on debugging how tokenization and matching works.

tarashchuk commented 1 week ago

I need to create github issue with length of field characters? Or it will be resolved in this issue? Also it will be great to have length of all characters of terms or something like this.

bratseth commented 1 week ago

Please create a separate one.

tarashchuk commented 1 week ago

@bratseth https://github.com/vespa-engine/vespa/issues/31733 Thanks!

tarashchuk commented 1 day ago

Any updates? When probably it will be add this?

jobergum commented 1 day ago

I'm closing this, this is more of a support request then anything else. If you like progress, feel free to add an example of what exactly you want us to implement in https://github.com/vespa-engine/vespa/issues/31733.