[Feature Request] Term boosting

jasonpolites commented 1 month ago

Hi,

First off.. let me say this library is amazing. I spent some time trying to find a good client-side index, and MiniSearch is by far the best I found.

Now that I've buttered you up...

I am implementing a basic document similarity mechanism that just takes the field of an existing document, and re-issues a search with the value that field.

For example:

Imagine a document with the following:

{
  "description" : "foo bar baz bazooka"
}

Finding similar items would mean issuing a search like this:

const options = {
  fields: ['description'],
  combineWith: 'OR'
};

minisearch.search("foo bar baz bazooka", options);

This works, but in my case, the first few terms are more meaningful than the last few. If you imagine the corpus has two other documents:

[{
  "description" : "foo bar bosh"
},
{
  "description" : "baz bazooka bar"
}]

The second item would (likely) get a higher score from MiniSearch. In reality (in my case) the first document should be higher.

So.. what I really want is:

minisearch.search("foo^2 bar baz bazooka", options);

The foo^2 applies a boost to that term. Similar to term boosting in Lucene

lucaong commented 1 month ago

Hi @jasonpolites , thanks for your kind words!

It am happy to say that it is already possible to apply term boosting with MiniSearch, although admittedly it is not too clear from the documentation. One way is using the boostDocument search option:

const searchOptions = {
  fields: ['description'],
  combineWith: 'OR',
  // Boost documents that contain the term 'foo'
  boostDocument: (docId, term) => (term === 'foo') ? 2 : 1
}

miniSearch.search('foo bar baz bazooka', searchOptions)

Finally, I do agree that it would be nice to have an easier way to specify term boosting. While it won't be implemented as a string query syntax like in Lucene (for reasons outlined in this past discussion), it is still possible to add a search option. I will brainstorm and hopefully come up with a nice implementation for this feature.

jasonpolites commented 1 month ago

Thanks for the reply. My example was obviously not a good one, but the premise remains that a field with more matching terms will rank more highly (which generally makes complete sense). It may not be universal across languages, or even across use cases, but in my use case the position of the term in the string carries importance. So the first token is more important than the last one.

How would I implement this in the sample you provided? The callback for the boostDocument property takes a docId and a term, but is there a way to know the index of the term, and how many terms there are? (e.g. (docId, term, termIndex, termCount))

In my example, "foo" is only more important because it's the first term, not because it's specifically "foo". What I want to do is boost the first few terms in the search string. E.g.

first^3 second^2 third fourth

lucaong commented 1 month ago

I had originally misunderstood your question, so I edited my answer a bit.

Boosting the first terms is not trivial, but feasible with something like this:

// Assuming you use the default tokenizer
const tokenize = MiniSearch.getDefault('tokenize')

// Function to search applying a boost to the first query terms
const searchWithBoost = (query, boostFactors = [], searchOptions = {}) => {
  const queryTerms = tokenize(query)

  const boosts = queryTerms.reduce((boosts, term, i) => {
    boosts[term] = boostFactors[i] || 1
    return boosts
  }, {})

  const searchOptionsWithBoost = {
    ...searchOptions,
    boostDocument: (docId, term) => boosts[term] || 1
  }

  return miniSearch.search(query, searchOptionsWithBoost)
}

// Usage, boosting the first term by a factor of 3 and the second by a factor of 2:
searchWithBoost('foo bar baz bazooka', [3, 2])

Admittedly this is not so simple, I will think about ways to improve this use case.

jasonpolites commented 1 month ago

Actually this is quite simple, and much better than what I have now as I am using some ugly regex to tokenize the search query, and this is a much more elegant approach.

I am not sure how common "term boosting" is as a requirement, and while a foo^2 style syntax is arguably the simplest (for users of your library), I think just adding a "Term Boosting" section to your docs with some sample code like this would be a quick fix for anyone else looking to do term boosting, ahead of a more "native" implementation.

I will try this out tonight. Thanks!

jasonpolites commented 1 month ago

OK.. I had to revise your example a bit, but got it working and it seems to do what I need.

I'm not sure what kind of dark magic you're weaving with the callback to the reduce method on the term array, but this call:

const boosts = queryTerms.reduce({}, (boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
});

...was a bit of a head-scratcher for me as it didn't match with the common signature for the reduce method

I rewrote it a little, and I think the outcome is the same:

const tokenize = MiniSearch.getDefault('tokenize');

searchWithBoost(query, boostFactors = [], options =[]) {

  const queryTerms = tokenize(query);

  const boosts = {};
  let boostIndex = 0;

  const reducer = (_accumulator, term, _i) => {
    if(boostIndex < boostFactors.length && term && term.length > 0 && !boosts[term]) {
      boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1
    }
  }

  queryTerms.reduce(reducer, queryTerms[0]);

  const searchOptionsWithBoost = {
    ...options,
    boostDocument: (_docId, term) => boosts[term] || 1
  }

  return miniSearch.search(query, searchOptionsWithBoost);
}

I did notice that the default tokenizer produces a lot of "empty string" tokens (i.e. ["foo", "", "", "", "bar", "", ""]) which I don't think harm anything (and the reason for my term.length > 0 check). I suspect this is because my query string often contains duplicate whitespace (\s) characters (I don't control the source text, and it's a mess). Could be a small optimization in the engine to omit/skip these empty tokens.

In my actual implementation I also skip any string tokens which "look like" numbers as they end up not being meaningful, but I left that out of my sample to avoid any confusion.

Thanks again! (feel free to close this issue, unless you want to keep it open as a reminder to look at a native API for this)

Edit: I'm seeing some unexpected results (results that I think should be ranked more highly), but I need to debug to see if this is a problem with the approach, or just something amiss with my specific implementation

Edit 2 🤦‍♂️ Case sensitivity (lots of my query strings are all upper case.. like I said, it's a mess). Updated the original to match

boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1

lucaong commented 1 month ago

I'm not sure what kind of dark magic you're weaving with the callback to the reduce method on the term array, but this call:

@jasonpolites you are right, sorry... I wrote that code in GitHub without testing and got the argument order wrong. I also missed a return. Dangers of not checking code 😬 Here's the correction:

// WRONG:
const boosts = queryTerms.reduce({}, (boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
})

// CORRECT:
const boosts = queryTerms.reduce((boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
  return boosts
}, {})

Basically, I am reducing the array of queryTerms into a map of term to boost factor. If we start from queryTerms = ['foo', 'bar', 'baz'] and boostFactors = [3, 2] we should end up with { foo: 3, bar: 2, baz: 1 }. I will go ahead and edit my code example above to avoid confusing other users reading this thread.

I did notice that the default tokenizer produces a lot of "empty string" tokens (i.e. ["foo", "", "", "", "bar", "", ""]) which I don't think harm anything (and the reason for my term.length > 0 check). I suspect this is because my query string often contains duplicate whitespace (\s) characters (I don't control the source text, and it's a mess). Could be a small optimization in the engine to omit/skip these empty tokens.

You actually spotted a bug here, thanks :) basically, a recently merged pull request missed a + in the regular expression. This is being fixed and a new patch release will be published later today.

lucaong commented 1 month ago

Version v7.0.2 is now released on NPM, fixing the regression causing the tokenizer to produce empty terms when multiple contiguous spaces or punctuation characters are present in the input.

lucaong commented 1 month ago

I opened a pull request to introduce term boosting as a convenient search option: https://github.com/lucaong/minisearch/pull/274

Once I am happy with the implementation and the documentation, I will probably merge it and make a new release.

@jasonpolites this feature will make your use case a lot simpler to solve in the near future.

jasonpolites commented 1 month ago

PR looks great. Very clean/simple. Thanks!

lucaong commented 1 month ago

@jasonpolites version v7.1.0 is published on NPM, including including the term boosting feature. I will go on and close this issue, as it should be now properly addressed. Thanks a lot for reporting!

lucaong / minisearch

[Feature Request] Term boosting #268