lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.8k stars 136 forks source link

Highlighting Matches #37

Closed janwirth closed 4 years ago

janwirth commented 4 years ago

Hey there :wave:

What is the preferred strategy to get the index of a hit within the original body in order to highlight it?

stalniy commented 4 years ago

This is what I ended up with:

function markHints(result) {
  const hints = {};

  result.terms.forEach((term) => {
    const regexp = new RegExp(`(${term})`, 'gi');

    result.match[term].forEach((field) => {
      const value = result[field];

      if (typeof value === 'string') {
        hints[field] = value.replace(regexp, '<mark>$1</mark>');
      } else if (field === 'headings') {
        const markedValue = value.reduce((items, h) => {
          if (h.title.toLowerCase().includes(term)) {
            items.push({
              id: h.id,
              title: h.title.replace(regexp, '<mark>$1</mark>'),
            });
          }
          return items;
        }, []);
        hints[field] = markedValue.length ? markedValue : null;
      }
    });
  });

  return hints;
}

You may need to update marking logic for your object structure. The function is used as this:

minisearch.search(query, options).map((result) => {
  result.hints = markHints(result);
  return result;
});
lucaong commented 4 years ago

Hi @FranzSkuffka ! Great question. MiniSearch does not return the offset of the terms in the documents. This allows the index to be much more compact, as it does not have to store these offsets for each (term, document) pair.

That said, it does return a match object for each result, indicating the matching terms and in which fields they were found. This allows you to find the term in the document after you get the result.

To clarify, here is an example:

const m = new MiniSearch({ fields: ['title', 'text'] })

// Add some documents
const documents = [
  { id: 1, title: 'Something interesting', text: 'Something really interesting' },
  { id: 2, title: 'Something fun', text: 'Yay!' }
]
m.addAll(documents)

// Let's also have a hash of documents by ID, it will be useful later:
const documentById = documents.reduce((byId, document) => {
  byId[document.id] = document
  return byId
}, {})

// If we search for "something", results each contain match information:
let results = m.search('something')
//=> [
//   {
//     id: 1,
//     terms: [ 'something' ],
//     score: 0.5776226504666211,
//     match: { something: [ 'title', 'text' ] }
//   },
//   {
//     id: 2,
//     terms: [ 'something' ],
//     score: 0,
//     match: { something: [ 'title' ] }
//   }
// ]

// This works also with fuzzy or prefix match, as the `match` info
// contains the actual terms that match:
m.search('realy', { fuzzy: true })
//=> [
//   {
//     id: 1,
//     terms: [ 'really' ],
//     score: 0.5472502609821137,
//     match: { really: [ 'text' ] }
//   }
// ]

// The above `match` means that the mistyped search "realy"
// matched document with ID 1, with the term "really" in field "text"

In other words, the match field in the result tells you which terms matched and in which fields. You can then find the index of the matched term in a field of the matched document with String.prototype.matchAll().

Unfortunately, this is a bit complicated, because we first have to normalize the field (with the default options it's enough to downcase it), and also matchAll is not available in all browsers, so it might need a polyfill.

I know this is not trivial, but offering this out of the box would have made either the inverted index much bigger, or MiniSearch substantially slower. I think that a utility computing these offsets from the search results could be a good idea for a library.

EDIT:

I just saw that @stalniy beat me on time on this ;) His solution looks like a good approach.

janwirth commented 4 years ago

Thank you both @stalniy & @lucaong.

I do understand that this feature will never be built into minisearch core.

Could this little code snippet find a home in the wiki?

Another post-processing step is clustering - if we have a long document with several matches we want to render the paragraph with a match and the highlighted text. If there is more than one match in a paragraph I want to highlight each match rather than printing the paragraph once for each match.

lucaong commented 4 years ago

I have a plan to include a "How to" section in the documentation website. This would fit very well there.

lucaong commented 4 years ago

I will close this issue for now, as the original question was answered, and put the “How to” section on the roadmap. Feel free to continue the discussion though.

akvadrako commented 2 years ago

What if the matched term is not actually in the document, for example due to stemming?

lucaong commented 2 years ago

@akvadrako in this case unfortunately MiniSearch won’t help with highlighting terms. In order to do so, the index would have to store a lot of meta-information (such as the position of terms in the documents). MiniSearch is optimized for constrained cases, such as browsers, so in the trade-off it chooses to minimize the index size/indexing time.

akvadrako commented 2 years ago

It would be nice if minisearch had that ability, since manual highlighting would also be problematic not just with stemming but also fuzzy search.

To help with resource use you could build a new index for just the matching documents. So instead of doing doc.indexOf(...), on could do something like new MiniSearch({..., withLocations: true }).add(doc).search('query').

lucaong commented 2 years ago

Thanks @akvadrako , I will consider adding this possibility in a major release. One note though: with fuzzy search, the match field will contain the actual term found in the document (apart from term processing), so the approach outlined above would work.

akvadrako commented 2 years ago

Thanks @lucaong, I will try it as it is. With my use case occasionally missing highlighting is acceptable.