olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 547 forks source link

Unexpected results when searching for DOJO in example #240

Closed lebsral closed 7 years ago

lebsral commented 7 years ago

In the example app http://lunrjs.com/example/

If I want to find 'dojo'

I type - d results - Many results including those with DOJO

I type - do results - 0 results - ie the unexpected behavior

I type - doj results - 2 results, both with DOJO in the title. as expected.

I type - dojo results - 2 results, both still with DOJO

I have found several other instances of this behavior with my own data. It doesn't just happen on the 2nd letter. This is just the simplest one to see it it.

Thanks.

olivernn commented 7 years ago

Not near a computer right now but this is an issue that has come up before. It's down to the way lunr uses stemming and probably also the stop word filter.

"Do" is a stop word and therefore doesn't appear in the index, in addition it will be stripped from any search terms.

You can turn off the stop word filter, search in the issues for examples of how to do that.

Sent from my iPhone

On 17 Nov 2016, at 20:37, Lars Bell notifications@github.com wrote:

In the example app http://lunrjs.com/example/

If I want to find 'dojo'

I type - d results - Many results including those with DOJO

I type - do results - 0 results - ie the unexpected behavior

I type - doj results - 2 results, both with DOJO in the title. as expected.

I type - dojo results - 2 results, both still with DOJO

I have found several other instances of this behavior with my own data. It doesn't just happen on the 2nd letter. This is just the simplest one to see it it.

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

lebsral commented 7 years ago

Ok, I hear you about do being a stop word, but look at this example. Searching for "programatically" (yes I know that is spelled wrong, but there is exactly one result that spells it that way, so it is easy to see).

programa - 1 result as expected programat - 1 result programati - 0 results - not expected programatic - 1 result programatica - 0 results - not expected programatical - 1 result programaticall - 0 results - not expected programatically - 1 result

Manually searching through the example_index.json I find that the stemming changed "programatically" to "programat"

Then I go into the example_data.json and change it to "programaticall" (ie drop the y) rerun make example now the index has changed to "programatical"

And redoing the tests above leads to results at every point.

I think that all means that it is in fact the stemming that is causing the unexpected results for users.

What can be done to improve the situation?

olivernn commented 7 years ago

You can disable stemming which will give better results when doing a search on every key press as it should get rid of the behavior you are seeing.

It will cause the size of the index to increase and may change the kind of results you get when searching.

Searching with full words, I.e on submit works better with the stemmer.

Again, you can look through the issues for examples of disabling the stemmer.

Sent from my iPhone

On 17 Nov 2016, at 21:19, Lars Bell notifications@github.com wrote:

Ok, I hear you about do being a stop word, but look at this example. Searching for "programatically" (yes I know that is spelled wrong, but there is exactly one result that spells it that way, so it is easy to see).

programa - 1 result as expected programat - 1 result programati - 0 results - not expected programatic - 1 result programatica - 0 results - not expected programatical - 1 result programaticall - 0 results - not expected programatically - 1 result

Manually searching through the example_index.json I find that the stemming changed "programatically" to "programat"

Then I go into the example_data.json and change it to "programaticall" (ie drop the y) rerun make example now the index has changed to "programatical"

And redoing the tests above leads to results at every point.

I think that all means that it is in fact the stemming that is causing the unexpected results for users.

What can be done to improve the situation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.