Open Plenitude-ai opened 2 days ago
Although we cannot change the feature's default value in Vespa 8, that could be considered for Vespa 9.
A much cheaper workaround avoiding putting all the queries in memory with attribute
and the cost of converting it to a tensor.
schema news {
document news {
field related_queries type array<string> {
indexing: summary | index | attribute
index: enable-bm25
match: text
}
field number_related_queries type int { indexing: summary | attribute}
}
rank-profile test_related_queries inherits default {
function bm25_related_queries_with_nulls() {
# If related_queries is specified, returns bm25
# else returns -1 (and not 0.0, the bm25() default value)
expression: if(attribute(number_related_queries) == 0, -1, bm25(related_queries))
}
first-phase {
expression: bm25_related_queries_with_nulls
}
}
Yes that's a way better idea Thank you ! 💯
Is your feature request related to a problem? Please describe. bm25 on fields with no value returns 0.0, just as other fields which values are not matching. This is an issue in learning to rank, as a missing value (Nan or null) does not necessarily mean that the field does not match at all, usually it just did not have time to be fed yet.
In my specific case, my issue arose with the field
related_queries
, a list of queries that could have been generated by the document (https://arxiv.org/abs/1904.08375). This field is therefore an array of strings. It is costly to compute, meaning only selected high performing documents get this signal (e.g. high PageRank first). We however observed that the bm25(related_queries) induces a huge overfitting bias because of the confusion between a missing value and an array of non-relevant queries.Describe the solution you'd like BM25 should not return 0.0 if there was no computation involved.
null
orNan
or-1
or just any value which could later be understood as a marker for a missing value, not a mismatch with the query.The same behavior happens for the following fields:
Describe alternatives you've considered I was able to circumvent the issue but at quite a cost, both in implementation and efficiency:
Example for the news tutorial application :
Additional context Add any other context or screenshots about the feature request here.