olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.87k stars 547 forks source link

Searching for things like "--hard" or "--help" breaks the search/returns no results #479

Open MikeArsenault opened 3 years ago

MikeArsenault commented 3 years ago

There seems to be a problem regarding escaping multiple characters, in that the search does not seem to understand back to back escaped characters. For example, we know there are 6 results for --help in the handbook.

As expected, searching --help leads you to the infinite load issue, and the following console error (search):

https://d.pr/i/vDlyYe

We are wondering if this is by design and we just haven't determined the right escape format? Our version of lunr is 2.3.7.

osama-rizk commented 3 years ago

According to Docs in Search + or - will determine the presence and Adsense of terms

So if you search for idx.search('+') or idx.search('++any_word') it will throw error expecting term or field, found nothing so each + or - must be followed with term

olivernn commented 3 years ago

Do you have an example of the search string you are using? You mention that back to back escapes do not work, can you provide an example of how you are escaping back to back characters?

A backslash is used to escape characters that would otherwise have meaning in a query, so, for example, I would expect \-\-help to work.

If you can setup a minimal reproduction demonstrating the issue in something like jsfiddle (or similar) that'd be a great help.

gilisho commented 3 years ago

I'm experiencing issues with escaping as well. I have an example from the demo. Search for flight\-\-a and it won't find anything, although the string flight--a exists in article number 2.

c00kiemon5ter commented 2 years ago

hello, I'm looking into the same issue; trying to escape a +. Escaping with \, as mentioned in the docs, does not seem to work. I think @gilisho's example demonstrates the issue well.

Instead of using Index.search I'm now trying to use Index.query. Using directly the index from @gilisho's example site, I am trying the following:

idx.search("flight")
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, …] (12)

idx.search("flight--a")
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, …] (12)

idx.search("flight\-\-a")
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, …] (12)

I think that's because the - and \ are removed by the tokenizer:

lunr.tokenizer("flight--a")
# Array (2) = $7
# 0 {str: "flight", metadata: {position: [0, 6], index: 0}, toString: function, update: function, clone: function}
# 1 {str: "a", metadata: {position: [8, 1], index: 1}, toString: function, update: function, clone: function}

lunr.tokenizer("flight\-\-a")
# Array (2) = $7
# 0 {str: "flight", metadata: {position: [0, 6], index: 0}, toString: function, update: function, clone: function}
# 1 {str: "a", metadata: {position: [8, 1], index: 1}, toString: function, update: function, clone: function}

Using the Index.query API:

idx.query(q => q.term(lunr.tokenizer("flight--a")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, …] (12)

idx.query(q => q.term(lunr.tokenizer("flight\-\-a")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, …] (12)

That was expected because the tokenizer removed the part we were interested in.

But, with the snippet below, I expected I would get back some results:

idx.query(q => q.term("flight--a"))
# []

To verify that the special meaning of - is not used with the Index.query API I did

idx.search("-")
# QueryParseError: expecting term or field, found nothing

idx.search("--")
# QueryParseError: expecting term or field, found 'PRESENCE'

idx.query(q => q.term(lunr.tokenizer("-")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, …] (12)

idx.query(q => q.term(lunr.tokenizer("--")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, …] (12)

Any hints on this @olivernn ?

user202729 commented 2 months ago

Essentially this is caused by the same issue as https://github.com/olivernn/lunr.js/issues/481 and https://github.com/olivernn/lunr.js/issues/245 --- either you remove the trimmer from the pipeline and/or customize the tokenizer.

That is,

By using query(q => q.term(…)), you achieved the second point. To achieve the first point you need to modify the indexer.

var index=lunr(function(){
    this.pipeline.reset();  // NOTE 1. reset the pipeline
    this.ref("ref");
    this.field("title");
    this.add({ref: "a", title: ["--"]}); // NOTE 2. put the field in an array so tokenizer doesn't try to split it, each array element become one token
})

Then:

index.query(q=>q.term("--"))
index.search(String.raw `\-\-`)