Closed Trehxn closed 2 years ago
Hi @Trehxn , in my opinion, the best way to meet your need regarding boosting before vs. after the first comma is to create two fields.
One way is to transform your documents so that they have titleEnBeforeComma
and titleEnAfterComma
as separate fields. If it's ok to mutate your documents to add those fields, you could do it like this:
documents.forEach((doc) => {
const [_, titleEnBeforeComma, titleEnAfterComma] = doc.titleEn.match(/([^,]*)(.*)/)
doc.titleEnBeforeComma = titleEnBeforeComma
doc.titleEnAfterComma = titleEnAfterComma
})
const miniSearch = new MiniSearch({
fields: ['titleEnBeforeComma', 'titleEnAfterComma' /* , ...other fields */],
searchOptions: {
boost: { titleEnAfterComma: 0.1 }
}
})
Alternatively, if you do not want to mutate your documents, you could create "virtual fields" in MiniSearch
by using a custom extractField
:
const miniSearch = new MiniSearch({
fields: ['titleEnBeforeComma', 'titleEnAfterComma' /* , ...other fields */],
extractField: (doc, fieldName) => {
if (fieldName !== 'titleEnBeforeComma' && fieldName !== 'titleEnAfterComma') {
return doc[fieldName]
}
const i = doc.titleEn.indexOf(',')
if (fieldName === 'titleEnBeforeComma') {
return (i === -1) ? doc.titleEn : doc.titleEn.slice(0, i)
}
if (fieldName === 'titleEnAfterComma') {
return (i === -1) ? '' : doc.titleEn.slice(i)
}
},
searchOptions: {
boost: { titleEnAfterComma: 0.1 }
}
})
Regarding ligatures, if there is only a fixed number of them, one option is to perform the normalization manually. Here is an example:
const replacements = {
'œ': 'oe',
'ü': 'ue',
'ä': 'ae',
'ö': 'oe'
}
const replaceMatch = (match) => replacements[match] || ''
const normalizeSpecialChars = (term) =>
term.replace(new RegExp(`[${Object.keys(replacements).join('')}]`, 'g'), replaceMatch)
const miniSearch = new MiniSearch({
fields: [/* ... */],
processTerm: (term) => normalizeSpecialChars(term.toLowerCase())
})
Otherwise, you could find some library to perform locale-specific normalization. MiniSearch
does not provide language-specific solutions, but allows you to plug your own by using a custom processTerm
.
Exactly what I needed, thanks a lot for your assistance. Everything checks out.
You are welcome @Trehxn :)
I will close the issue for now, but feel free to comment further if something is needed.
Hi, I have a field that includes string with multiple words split by a comma. for eg: titleEn: 'avocado, cubed'. I want to assign weight 1.0 to the segment before the comma and 0.1 to the segment after. Is this possible?
Also I have read issues where you have solved the cases for accents but what can be done in the case of ligatures? or is there any support for the French language that can be used here?
Thanks