Searching with whitespace, 'and', and 'the'

kg-currenxie commented 5 years ago

Hi. I'm having some issues doing a correct search query. Here's a few example with code:

const lunrIndex = lunr(function () {
  this.ref('cca2')
  this.field('name')
  this.field('cca2')
  this.field('cca3')

  countries.forEach(country => {
    this.add(country)
  }, this)
})

My queries look like this: (Searching for Saint Vincent and the Grenadines)

"name:*Sa* cca2:Sa^1000 cca3:Sa*^1000"
"name:*Sai* cca3:Sai*^1000"
"name:*Saint*"
"name:*Saint*Vin*"(0 results)
"name:*And*" (0 results)
"name:*The*"(0 results)

Some other notes:

1. I've tried escaping the space with \\ as well, but didn't seem to work (as found here https://github.com/olivernn/lunr.js/issues/366)

2. Wildcards as spaces isn't great either. Searching for Sweden as Wed n still gives me Sweden as a result, which isn't what I expect.

OK	Not expected

What am I doing wrong, and how can I fix the query to match spaces as exact, and how come it doesn't find the words and and the? :)

Thanks for building Lunr! It's amazing.

yeraydiazdiaz commented 5 years ago

I believe the problem lies in that you're expecting the full name of the country to be treated as a single token. Lunr by default tokenizes the fields by whitespace so "Saint Vincent and the Geraldines" is indexed as 3 separate tokens, the three main words minus "and" and "the" which are removed from the index since they're considered stopwords.

When attempting to search for *Saint*Vin* Lunr sees no matching tokens since "Saint" and "Vincent" are different tokens neither of which include both terms.

You can, however, change this behaviour via the fairly hidden lunr.tokenizer.separator (search for tokenizer).

Here's a quick JSFiddle.

However, you should also note that Lunr stems tokens by default, for example "Saint Vincent and the Grenadines" will be shortened at the end removing the final "es", you can also customize this behaviour via the pipeline functions.

Hope this helps.

kg-currenxie commented 5 years ago

Looks good! Thanks <3

olivernn / lunr.js

Searching with whitespace, 'and', and 'the' #397