olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.87k stars 547 forks source link

"A NAME" value gets indexed unexpectedly compared to "B NAME" #478

Closed Strat1987 closed 3 years ago

Strat1987 commented 3 years ago

We're experiencing unexpected invertedIndex for a specific value it seems:

import lunr from 'lunr'

// No tokens per white space because we need to determine exact matches at runtime
// eslint-disable-next-line no-empty-character-class
**lunr.tokenizer.separator = /[]/**

const products = [
    {
        id: 29,
        REF: 'A',
        NAME: 'A NAME'
    },
    {
        id: 31,
        REF: 'B',
        NAME: 'B NAME'
    }
]

const idx = lunr(function() {
    this.ref('id')
    this.field('REF')
    this.field('NAME')

    products.forEach(function(p) {
        this.add(p)
    }, this)
})

console.log(
    'search result "a"',
    idx.query(q => {
        q.term(lunr.tokenizer('a'), {boost: 100, fields: ['REF']})
    })
)
console.log(
    'search result for "b"',
    idx.query(q => {
        q.term(lunr.tokenizer('b'), {boost: 100, fields: ['REF']})
    })
)

console.log('index', JSON.stringify(idx))

This gives following output search result "a" [] search result for "b" [ { ref: '31', score: 0.49200000000000005, matchData: { metadata: [Object: null prototype] } } ]

{"version":"2.3.9","fields":["REF","NAME"],"fieldVectors":[["REF/29",[]],["NAME/29",[0,0.693]],["REF/31",[1,0.492]],["NAME/31",[2,0.693]]],"invertedIndex":[["a nam",{"_index":0,"REF":{},"NAME":{"29":{}}}],["b",{"_index":1,"REF":{"31":{}},"NAME":{}}],["b name",{"_index":2,"REF":{},"NAME":{"31":{}}}]],"pipeline":["stemmer"]}

Especially the ""a nam" seems odd in the invertedIndex as well as the lack of the "a" key

Strat1987 commented 3 years ago

is there a way to customize the lunr.stopWordFilter? https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L1194

Strat1987 commented 3 years ago

The answer of https://github.com/olivernn/lunr.js/issues/408 addresses the challenge I was having with this.

The stopWordFilter can be disabled const idx = lunr(function() { this.pipeline.remove(lunr.stopWordFilter)

Strat1987 commented 3 years ago

@olivernn Somehow, the stemmer is responsible for changing the "A NAME" value to "A NAM", however following reference would leave the value untouched: http://9ol.es/porter_js_demo.html

Strat1987 commented 3 years ago

Even when performing a full reset of the pipeline const idx = lunr(function() { this.pipeline.reset() the index still prints some reference to the stemmer pipeline

index {"version":"2.3.9","fields":["REF","NAME"],"fieldVectors":[["REF/29",[0,0.693]],["NAME/29",[1,0.693]],["REF/31",[2,0.693]],["NAME/31",[3,0.693]]],"invertedIndex":[["a",{"_index":0,"REF":{"29":{}},"NAME":{}}],["a name",{"_index":1,"REF":{},"NAME":{"29":{}}}],["b",{"_index":2,"REF":{"31":{}},"NAME":{}}],["b name",{"_index":3,"REF":{},"NAME":{"31":{}}}]],"pipeline":["stemmer"]}

A search query on "a name" still remains unresolved:

console.log( 'search result for tokenized "a name"', idx.query(q => { q.term(lunr.tokenizer('a name'), {boost: 100, fields: ['NAME']}) }) )

search result for "a name" []

while a similar call for "b name" does return a result: search result for "b name" [ { ref: '31', score: 0.693, matchData: { metadata: [Object: null prototype] } } ]

both are now fields in the invertedIndex

Strat1987 commented 3 years ago

The final piece to this puzzle for me was to also reset the the searchPipeline which was still using the stemmer this.searchPipeline.reset()

olivernn commented 3 years ago

It looks like you managed to solve your own problem, closing this one now, feel free to comment if there is still something that needs clearing up.