'json' query does not return expected results from 'application/json'

olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

http://lunrjs.com

MIT License

8.89k stars 545 forks source link

'json' query does not return expected results from 'application/json' #365

Closed lewisnyman closed 6 years ago

lewisnyman commented 6 years ago

Hi 👋 I'm a bit stumped by the behaviour I'm seeing in this implementation. For some reason the term json doesn't return pages with API documentation in it, but application/json does. Is it related to the tokeniser?

I've created a jsfiddle here with the same content and the default pipeline functions turned off: https://jsfiddle.net/f3sbrtq2/25/

hoelzro commented 6 years ago

@lewisnyman Hi! Your suspicions about the tokenizer are spot on; it only splits tokens on - characters and whitespace by default. lunr uses full tokens to consult its inverted index - it doesn't find all documents containing application/json under the set of documents with the standalone json token, hence the "missing" results.

Another thing I noticed about your example; you remove everything from the builder pipeline, but you leave the search pipeline alone - you might want to this.searchPipeline.remove(lunr.stemmer) as well.

Out of curiosity, what are you trying to use lunr to accomplish? By removing all of the processing functions from the pipeline you're removing a lot of the value of what lunr provides! Or was that just for the purposes of your example?

olivernn commented 6 years ago

What @hoelzro says is spot on. You can customise what is considered a separator by overriding the lunr.tokenizer.separator property. If you want even more control (at the cost of more work) you can implement a custom tokeniser and use it when building an index:

lunr(function () {
  this.tokenizer = myCustomTokenizer
})

lewisnyman commented 6 years ago

Thanks for the advice. I've update the codepen now with the fix: https://jsfiddle.net/f3sbrtq2/37/ this.tokenizer.separator = /[\s\-/]+/; Note: I'm running an older version of Lunr (0.7.0) so my line is: this. tokenizerFn.seperator = /[\s\-/]+/;

@hoelzro I was removing all of the default processing just to prove that nothing unexpected was affecting the query string. In my real world use case I've only replaced the stop word filter. Thanks for the searchPipeline tip, I completely missed that I need to remove it twice in the docs and examples.

olivernn commented 6 years ago

Note: I'm running an older version of Lunr (0.7.0)

What is blocking you from upgrading to Lunr 2.x?

lewisnyman commented 6 years ago

We're using middleman-search which helpfully prebuilds the index but is not actively maintained manastech/middleman-search#29