lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.67k stars 133 forks source link

Boost recency #205

Closed chrisj74 closed 1 year ago

chrisj74 commented 1 year ago

Hi, Hop you can help. I'd like to boost results best on recency. I have a lastUsed property that stores a unix time value as a number, is there a way that results can take this into consideration when weighting? I know i can limit results with the filter, but this is not what i need. I want results with a higher (more recent time) to have a boost.

Many thanks

acnebs commented 1 year ago

I'm also interested in this behavior, though my dates are stored as ISO 8601 strings rather than unix timestamps.

lucaong commented 1 year ago

Hi @chrisj74 and @acnebs , it is possible to achieve what you want by using the boostDocument search option: it can be set to a function that takes a document ID (and a matching search term, which is not relevant in your case) and should return a boost value:

const miniSearch = new MiniSearch({
  fields: [\* ... *\],
  searchOptions: {
    boostDocument: (id, term) => {
      // Here you can return a boost factor based on the document ID and, optionally, the term
    }
  }
})

You would need to turn a timestamp into a reasonable boost value (which should be a small positive number). How to do this depends on your specific use case. A simple option would be to boost results more recent than a certain timestamp, or to have different level of boosting for different thresholds of recency.

Here's a quick example, following the simplest approach to boost results with a timestamp more recent than 10 days ago:

// Here timestamps are in milliseconds since Unix epoch
const documents = [
  { id: 1, title: 'Not recent', timestamp: 1667401507702 },
  { id: 2, title: 'Somehow recent', timestamp: 16774028054378 },
  { id: 3, title: 'Most recent', timestamp: 1677403006317 }
]

const documentById = documents.reduce((byId, doc) => {
  byId[doc.id] = doc
  return byId
}, {})

const tenDaysAgo = (new Date()).valueOf() - (1000*60*60*24*10)

const miniSearch = new MiniSearch({
  fields: ['title'],
  searchOptions: {
    // Boost documents more recent than 10 days ago
    boostDocument: (id) => (documentById[id].timestamp > tenDaysAgo) ? 1.5 : 1.0
  }
})

// This should sort "Most recent" and "Somehow recent" before "Not recent"
miniSearch.search('recent')

Coming up with a continuous function for the boosting is possible too. This formula, for example, boosts by ~1 very recent timestamps, and boosts older timestamps less and less (a timestamp 1 day ago would receive a boosting of ~0.77, 10 days ago in the past would receive a boosting of ~0.49, 100 days in the past a boosting of ~0.33, etc.):

1 / (1 + Math.log10(1 + (currentTimestamp - timestamp)/(1000*60*60*24)))

Dividing timestamp by a factor greater than 1 will reduce the boosting more gradually, but fine-tuning such a function can only be done with knowledge of your domain. I would recommend starting with a simple solution, and working from there.

@acnebs your case is basically the same, but you would need to parse the date first, to turn it into an integer timestamp.

I hope this helps.

chrisj74 commented 1 year ago

Thanks for the detailed response - I'll give this a try, seems straight forward :)

acnebs commented 1 year ago

Yes, thank you for this! Starting to think I should store my dates as unix timestamps instead of ISO 8601 strings 😅

Out of curiosity, what in general are the performance implications of the boostDocument feature? Seems like this might be quite slow to use for this use-case if you are dealing with a reasonably large dataset?

lucaong commented 1 year ago

@acnebs the function is called for each document that matches the full-text search (not for all documents in the index), so if you keep it reasonably fast, the performance hit should not be noticeable in most cases. That said, slow operations should be definitely avoided inside that function. The examples above should be fast enough, even on large datasets (ideally, computation that can be done only once, such as getting the current timestamp, should be extracted outside of the function).

Another option, especially when running MiniSearch on the client side and reindexing on the fly, is to precompute the recency boosting for all documents right before indexing them, then in boostDocument simply access the boosting field. This incurs in a greater cost upon indexing, but lower cost upon search time. Honestly though, I would not expect this to make a perceivable difference in this case, as the calculation is very fast, so I would just go for the approach in the examples above. In your case, maybe the one thing that could make sense is to pre-parse the ISO8601 string into an integer timestamp before indexing.

In most cases, rendering the search results, especially for long lists, is slower than retrieving the results with MiniSearch, so I would rather optimize the rendering part.

lucaong commented 1 year ago

@chrisj74 @acnebs I will close the issue now, as I think that the question is answered, but feel free to comment further and I will reply and/or reopen the issue if necessary.

I hope my answer helped you implementing your features :)