Closed tarashchuk closed 1 day ago
It's not clear what you want to achieve, could you amend with a clearer example of what you want than "it can be 2 and for Cats it can be 1.".
I want, that some of function of ranking show me, how many searched phrase occurrence in string. 'The mighty cat of catsville' show me 2 for example.
{
"root": {
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 3
},
"coverage": {
"coverage": 100,
"documents": 77,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:search:search::tQtmaccKYmHkaaCy",
"relevance": 1.0,
"source": "search",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCy",
"name_array": [
"Cats"
],
"Id": "tQtmaccKYmHkaaCy",
"Name": "Cats"
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCg",
"relevance": 0.8945312499999999,
"source": "search",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCg",
"name_array": [
"Cats",
"lover"
],
"Id": "tQtmaccKYmHkaaCg",
"Name": "Cats lover"
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCh",
"relevance": 0.7525,
"source": "clan",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCh",
"name_array": [
"The",
"mighty",
"cat",
"of",
"catsville"
],
"Id": "tQtmaccKYmHkaaCh",
"Name": "The mighty cat of catsville"
}
}
]
}
}
Example of documents
Thanks. Regular text matching is on the token (word) level, so while there are features that give you the number of matched occurrences, "catsville" will not be matched by "cat".
You can switch to matching substrings by changing the field definition to use gram matching.
Thanks, can you plz give some example of realization for calculating occurrences? And what function i can use for it in ranking?
I recommend using the nativeRank feature here - it takes into account the number of matches occurrences as well as proximity, which is important when using gram matching.
To be clear it won't give you the exact numbers you shared below, but should give you the same rank order.
Thanks, can you plz share example? And how to get documents, when we have field name like 'Macatos'?
Just add "match: gram" to the relevant fields in the schema, and use e.g
first-phase {
expression: nativeRank
}
as the ranking expression. No other changes needed but if you already have data indexed with token matching you need to reindex (happens automatically on cloud, if you are hosting yourself you need ti trigger it, or just rewrite the data).
Not working good, the ranking is not same.
field Name type string {
indexing: summary | index
match: gram
}
rank-profile searchByName {
first-phase {
expression: nativeRank(Name)
}
summary-features {
fieldTermMatch(Name,0).occurrences
queryTermCount
fieldMatch(Name).segments
fieldMatch(Name).matches
fieldMatch(Name).segmentDistance
fieldMatch(Name).gaps
textSimilarity(Name)
elementSimilarity(Name)
matchCount(Name)
fieldLength(Name)
fieldMatch(Name).absoluteOccurrence
}
}
{
"hits": 125,
"offset": 0,
"ranking": {
"profile": "searchByName"
},
// "trace":{
// "level" :"2"
// },
"yql": "select * from search where Name contains ('cat')"
}
{
"root": {
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 4
},
"coverage": {
"coverage": 100,
"documents": 77,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:search:search::tQtmaccKYmHkaaCy",
"relevance": 0.3343789820215161,
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCy",
"Id": "tQtmaccKYmHkaaCy",
"Name": "Cats",
"summaryfeatures": {
"elementSimilarity(Name)": 0.9333333333333333,
"fieldLength(Name)": 3.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 2.0,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.9333333333333333,
"vespa.summaryFeatures.cached": 0.0
}
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCg",
"relevance": 0.3320460953155481,
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCg",
"Id": "tQtmaccKYmHkaaCg",
"Name": "Cats lover",
"summaryfeatures": {
"elementSimilarity(Name)": 0.8571428571428572,
"fieldLength(Name)": 7.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 2.0,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.8571428571428572,
"vespa.summaryFeatures.cached": 0.0
}
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCh",
"relevance": 0.24596591494676392,
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCh",
"Id": "tQtmaccKYmHkaaCh",
"Name": "The mighty cat of catsville",
"summaryfeatures": {
"elementSimilarity(Name)": 0.8222222222222223,
"fieldLength(Name)": 18.0,
"fieldMatch(Name).absoluteOccurrence": 0.02,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 2.0,
"matchCount(Name)": 2.0,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.8222222222222223,
"vespa.summaryFeatures.cached": 0.0
}
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCo",
"relevance": 0.221442976434951,
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCo",
"Id": "tQtmaccKYmHkaaCo",
"Name": "Macatos",
"summaryfeatures": {
"elementSimilarity(Name)": 0.8666666666666667,
"fieldLength(Name)": 6.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 2.0,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.8666666666666667,
"vespa.summaryFeatures.cached": 0.0
}
}
}
]
}
}
Ok, fair enough. This is because the nativeRank formula takes multiple things into consideration, not just the number of occurrences, but also the length of the field, how early the occurrences are etc.
If you want the occurrences to matter more you can configure a linear boost per occurrence (occurrenceCountTable), and/or decrease the first occurrence importance (firstOccurrenceImportance). See https://docs.vespa.ai/en/reference/nativerank.html
Alternatively, if you don't want any traces of good relevance but just order by occurrences, you can use fieldMatch(Name).occurrence instead of nativeRank.
fieldMatch(Name).occurrence give me very wierd result.
{
"root": {
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 4
},
"coverage": {
"coverage": 100,
"documents": 77,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:search:search::tQtmaccKYmHkaaCy",
"relevance": 0.9333333333333333,
"source": "search",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCy",
"Id": "tQtmaccKYmHkaaCy",
"Name": "Cats",
"summaryfeatures": {
"bm25(Name)": 0.0,
"elementSimilarity(Name)": 0.9333333333333333,
"fieldLength(Name)": 3.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
fieldMatch(Name).occurrence": 0.6666666666666666,
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 2.0,
"nativeRank(Name)": 0.3343789820215161,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.9333333333333333,
"vespa.summaryFeatures.cached": 0.0
}
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCo",
"relevance": 0.8666666666666667,
"source": "search",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCo",
"Id": "tQtmaccKYmHkaaCo",
"Name": "Macatos",
"summaryfeatures": {
"bm25(Name)": 0.0,
"elementSimilarity(Name)": 0.8666666666666667,
"fieldLength(Name)": 6.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
**"fieldMatch(Name).occurrence": 0.3333333333333333,**
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 2.0,
"nativeRank(Name)": 0.221442976434951,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.8666666666666667,
"vespa.summaryFeatures.cached": 0.0
}
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCg",
"relevance": 0.8571428571428572,
"source": "search",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::tQtmaccKYmHkaaCg",
"Id": "tQtmaccKYmHkaaCg",
"Name": "Cats lover",
"summaryfeatures": {
"bm25(Name)": 0.0,
"elementSimilarity(Name)": 0.8571428571428572,
"fieldLength(Name)": 7.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
**"fieldMatch(Name).occurrence": 0.2857142857142857,**
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 2.0,
"nativeRank(Name)": 0.3320460953155481,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.8571428571428572,
"vespa.summaryFeatures.cached": 0.0
}
}
},
{
"id": "id:search:search::tQtmaccKYmHkaaCh",
"relevance": 0.8222222222222223,
"source": "search",
"fields": {
"sddocname": "search",
"documentid": "id:cjsearch:cjsearch::tQtmaccKYmHkaaCh",
"Id": "tQtmaccKYmHkaaCh",
"Name": "The mighty cat of catsville",
"summaryfeatures": {
"bm25(Name)": 0.0,
"elementSimilarity(Name)": 0.8222222222222223,
"fieldLength(Name)": 18.0,
"fieldMatch(Name).absoluteOccurrence": 0.02,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 2.0,
**"fieldMatch(Name).occurrence": 0.2222222222222222,**
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 2.0,
"matchCount(Name)": 2.0,
"nativeRank(Name)": 0.24596591494676392,
"queryTermCount": 2.0,
"textSimilarity(Name)": 0.8222222222222223,
"vespa.summaryFeatures.cached": 0.0
}
}
}
]
}
Occurrence factors in the length of the field.
You can use fieldMatch(Name).absoluteOccurrence, which is 1/100 of the number of occurrences.
{
"id": "id:search:search::mAhEfYwbHmDvsxF",
"relevance": 0.45,
"source": "search",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::mAhEfYwbHmDvsxF",
"Id": "mAhEfYwbHmDvsxF",
"Name": "fewCats",
"summaryfeatures": {
"bm25(Name)": 0.0,
"elementSimilarity(Name)": 0.9,
"fieldLength(Name)": 6.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 3.0,
"fieldMatch(Name).occurrence": 0.5,
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 3.0,
"nativeRank(Name)": 0.21789950651775084,
"queryTermCount": 3.0,
"textSimilarity(Name)": 0.9,
"vespa.summaryFeatures.cached": 0.0
},
"Members": [
743656
],
"PointsThreshold": 0,
"Language": "English",
"Country": "UA",
"TotalPoints": 23503,
"EstablishedTime": 1707987266,
"Type": 1
}
},
{
"id": "id:search:search::mAhEfYwbHmDvsxa",
"relevance": 0.45,
"source": "clan",
"fields": {
"sddocname": "search",
"documentid": "id:search:search::mAhEfYwbHmDvsxa",
"Id": "mAhEfYwbHmDvsxa",
"Name": "Test Cats",
"summaryfeatures": {
"bm25(Name)": 0.0,
"elementSimilarity(Name)": 0.9,
"fieldLength(Name)": 6.0,
"fieldMatch(Name).absoluteOccurrence": 0.01,
"fieldMatch(Name).gaps": 0.0,
"fieldMatch(Name).matches": 3.0,
"fieldMatch(Name).occurrence": 0.5,
"fieldMatch(Name).segmentDistance": 0.0,
"fieldMatch(Name).segments": 1.0,
"fieldTermMatch(Name,0).occurrences": 1.0,
"matchCount(Name)": 3.0,
"nativeRank(Name)": 0.21789950651775084,
"queryTermCount": 3.0,
"textSimilarity(Name)": 0.9,
"vespa.summaryFeatures.cached": 0.0
},
"Members": [
743656
],
"PointsThreshold": 0,
"Language": "English",
"Country": "UA",
"TotalPoints": 23503,
"EstablishedTime": 1707987266,
"Type": 1
}
},
I didn't understand, why for this documents we have "elementSimilarity(Name)": 0.9 same, also same "fieldLength(Name)": 6.0. Also can we have ability to calculate number of chars in string field? Request:
{
"hits": 125,
"offset": 0,
"ranking": {
"profile": "searchByName"
},
// "trace":{
// "level" :"2"
// },
"yql": "select * from cjsearch where Name contains 'cats'"
}
Fieldlength here is in number of tokens. Since this is n-gram there are more tokens than words. Same for elementSimilarity I think, if you don't see why after reading the doc please point out exactly what you would expect.
There is no rank feature for the length of a field in characters but you can create a github issue on it.
It can also maybe help starting the the text search tutorial which has a section on debugging how tokenization and matching works.
I need to create github issue with length of field characters? Or it will be resolved in this issue? Also it will be great to have length of all characters of terms or something like this.
Please create a separate one.
@bratseth https://github.com/vespa-engine/vespa/issues/31733 Thanks!
Any updates? When probably it will be add this?
I'm closing this, this is more of a support request then anything else. If you like progress, feel free to add an example of what exactly you want us to implement in https://github.com/vespa-engine/vespa/issues/31733.
I use request but it not help.
select * from search where Name contains 'cat'
for searching. How to calculate occurrence phrase in text? For example for string like 'The mighty cat of catsville' - it can be 2 and for 'Cats' it can be 1. Used index, attribute for field, used different type of function, like matchCount, fieldTermMatch(Name,0).occurrences and etc. Also split string to array