lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.64k stars 133 forks source link

Re-index to new documents with unknown changes #253

Closed tommedema closed 5 months ago

tommedema commented 5 months ago

I'm responsible for reviewing a colleague's PR and they've implemented minisearch in React as follows:

const searchEngine = useMemo(() => {
    const search = new MiniSearch({
      fields: keys,
      idField: 'uuid',
      storeFields: ['userId', 'email', 'name', 'picture', 'role'],

      searchOptions: {
        fields: keys,
        prefix: true,
        fuzzy: 0.2,
        boost: {
          email: 1.5,
          name: 1,
        },
        weights: {
          fuzzy: 0.8,
          prefix: 0.2,
        },
        bm25: {
          b: 1,
          d: 1,
          k: 2,
        },
      },
    })

    const mappedSelectableContacts = selectableContacts.map((contact) => ({
      ...contact,
      uuid: contact.userId ?? contact.email,
    }))

    search.addAll(mappedSelectableContacts)

    return search
  }, [keys, selectableContacts])

While this works it seems inefficient given that the entire index is reconstructed. Since this is a react effect, we don't know what the actual changes are within selectableContacts. How would you approach this?

lucaong commented 5 months ago

Hi @tommedema , if you don’t know which contacts have changed, reindexing all of them is the only way. Depending on how many contacts you have, if they don’t change too often, this should be fine. For example, for a few thousands of contacts changing infrequently this is a good solution.

You could make it more efficient if you know which contact(s) changed: in that case you can use the replace method only for those contacts, and avoid reindexing the rest. How to keep track of what changed is beyond the scope of MiniSearch though, and depending on your application it might be easy or not at all. If the contact list is very large, it probably makes sense to track changes and avoid a full reindex.

I hope this helps

rolftimmermans commented 5 months ago

@tommedema My advice would be to index all contacts once and to use the filter option to only search within the selectableContacts; for example based on equality of a contact ID (or any other unique field within contacts).

MiniSearch is pretty fast though, so if the number of contacts is small and the contact data itself is small, then reindexing everything would not be very time consuming.

lucaong commented 5 months ago

If the selectableContacts are a subset of the whole list of contacts, and the contacts themselves don’t change, rather the specific subset changes, then indeed what @rolftimmermans outlined is the best strategy.

tommedema commented 5 months ago

Thank you!