Bm25 (and other lexical features) on missing/unspecified fields returns `0.0`, when it should return `null`

Plenitude-ai commented 2 days ago

Is your feature request related to a problem? Please describe. bm25 on fields with no value returns 0.0, just as other fields which values are not matching. This is an issue in learning to rank, as a missing value (Nan or null) does not necessarily mean that the field does not match at all, usually it just did not have time to be fed yet.

In my specific case, my issue arose with the field related_queries, a list of queries that could have been generated by the document (https://arxiv.org/abs/1904.08375). This field is therefore an array of strings. It is costly to compute, meaning only selected high performing documents get this signal (e.g. high PageRank first). We however observed that the bm25(related_queries) induces a huge overfitting bias because of the confusion between a missing value and an array of non-relevant queries.

Describe the solution you'd like BM25 should not return 0.0 if there was no computation involved.null or Nan or -1 or just any value which could later be understood as a marker for a missing value, not a mismatch with the query.

The same behavior happens for the following fields:

elementCompleteness(related_queries).completeness
elementCompleteness(related_queries).fieldCompleteness
elementCompleteness(related_queries).queryCompleteness
elementCompleteness(related_queries).elementWeight
elementSimilarity(related_queries)

Describe alternatives you've considered I was able to circumvent the issue but at quite a cost, both in implementation and efficiency:

expose the field as an attribute (memory overhead)
use tensorFromLabel to convert the array to tensor (memory + cpu overhead)
declare a specific ranking expression to check for tensor size and return desired value (cpu overhead)

Example for the news tutorial application :

schema news {
    document news {
        field related_queries type array<string> {
            indexing: summary | index | attribute
            index: enable-bm25
            match: text
        }
   }
  rank-profile test_related_queries inherits default {
      function tensor_related_queries() {
            expression: tensorFromLabels(attribute(related_queries), rel_q)
      }
      function bm25_related_queries_with_nulls() {
            # If related_queries is specified, returns bm25
            # else returns -1 (and not 0.0, the bm25() default value)
            expression: if (reduce(tensor_related_queries(), count, rel_q)==0,  -1, bm25(related_queries))
      }
      first-phase {
            expression: bm25_related_queries_with_nulls
      }
      summary-features {
            tensor_related_queries
            bm25_related_queries_with_nulls
      }
}

Additional context Add any other context or screenshots about the feature request here.

jobergum commented 1 day ago

Although we cannot change the feature's default value in Vespa 8, that could be considered for Vespa 9.

A much cheaper workaround avoiding putting all the queries in memory with attribute and the cost of converting it to a tensor.

schema news {
    document news {
        field related_queries type array<string> {
            indexing: summary | index | attribute
            index: enable-bm25
            match: text
        }
        field number_related_queries type int { indexing: summary | attribute}
   }
  rank-profile test_related_queries inherits default {

      function bm25_related_queries_with_nulls() {
            # If related_queries is specified, returns bm25
            # else returns -1 (and not 0.0, the bm25() default value)
            expression: if(attribute(number_related_queries) == 0, -1, bm25(related_queries))
      }
      first-phase {
            expression: bm25_related_queries_with_nulls
      }

}

Plenitude-ai commented 1 day ago

Yes that's a way better idea Thank you ! 💯

vespa-engine / vespa

Bm25 (and other lexical features) on missing/unspecified fields returns `0.0`, when it should return `null` #32905