lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.81k stars 137 forks source link

Specific weights to fields #162

Closed Trehxn closed 2 years ago

Trehxn commented 2 years ago

Hi, I have a field that includes string with multiple words split by a comma. for eg: titleEn: 'avocado, cubed'. I want to assign weight 1.0 to the segment before the comma and 0.1 to the segment after. Is this possible?

Also I have read issues where you have solved the cases for accents but what can be done in the case of ligatures? or is there any support for the French language that can be used here?

Thanks

lucaong commented 2 years ago

Hi @Trehxn , in my opinion, the best way to meet your need regarding boosting before vs. after the first comma is to create two fields.

One way is to transform your documents so that they have titleEnBeforeComma and titleEnAfterComma as separate fields. If it's ok to mutate your documents to add those fields, you could do it like this:

documents.forEach((doc) => {
  const [_, titleEnBeforeComma, titleEnAfterComma] = doc.titleEn.match(/([^,]*)(.*)/)
  doc.titleEnBeforeComma = titleEnBeforeComma
  doc.titleEnAfterComma = titleEnAfterComma
})

const miniSearch = new MiniSearch({
  fields: ['titleEnBeforeComma', 'titleEnAfterComma' /* , ...other fields */],
  searchOptions: {
    boost: { titleEnAfterComma: 0.1 }
  }
})

Alternatively, if you do not want to mutate your documents, you could create "virtual fields" in MiniSearch by using a custom extractField:

const miniSearch = new MiniSearch({
  fields: ['titleEnBeforeComma', 'titleEnAfterComma' /* , ...other fields */],
  extractField: (doc, fieldName) => {
    if (fieldName !== 'titleEnBeforeComma' && fieldName !== 'titleEnAfterComma') {
      return doc[fieldName]
    }

    const i = doc.titleEn.indexOf(',')

    if (fieldName === 'titleEnBeforeComma') {
     return (i === -1) ? doc.titleEn : doc.titleEn.slice(0, i)
    }

    if (fieldName === 'titleEnAfterComma') {
      return (i === -1) ? '' : doc.titleEn.slice(i)
    }
  },
  searchOptions: {
    boost: { titleEnAfterComma: 0.1 }
  }
})

Regarding ligatures, if there is only a fixed number of them, one option is to perform the normalization manually. Here is an example:

const replacements = {
  'œ': 'oe',
  'ü': 'ue',
  'ä': 'ae',
  'ö': 'oe'
}

const replaceMatch = (match) => replacements[match] || ''

const normalizeSpecialChars = (term) =>
  term.replace(new RegExp(`[${Object.keys(replacements).join('')}]`, 'g'), replaceMatch)

const miniSearch = new MiniSearch({
  fields: [/* ... */],
  processTerm: (term) => normalizeSpecialChars(term.toLowerCase())
})

Otherwise, you could find some library to perform locale-specific normalization. MiniSearch does not provide language-specific solutions, but allows you to plug your own by using a custom processTerm.

Trehxn commented 2 years ago

Exactly what I needed, thanks a lot for your assistance. Everything checks out.

lucaong commented 2 years ago

You are welcome @Trehxn :)

I will close the issue for now, but feel free to comment further if something is needed.