olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Support term presence in queries #332

Closed olivernn closed 6 years ago

olivernn commented 6 years ago

Adds support for modifying the required presence of a given term in matching documents.

Currently, when searching with multiple terms, each term is optional in the search results. That is, a search for "foo bar" will return results with either "foo" or "bar" or both "foo" and "bar", effectively terms are combined with a logical OR.

This change adds support for marking terms as either required or prohibited. A required term must appear in a document for that document to be returned in the search results. A prohibited term must not appear in a document for that document to be returned.

A term's presence is indicated by either a prefixed "+" (plus) for required terms, and a "-" (minus) for prohibited term when using lunr.Index#search. Constructing queries programatically using lunr.Index#query also support presence using the presence option with a value the lunr.Query.presence enum.

For example, to perform a search for documents that must contain "foo" and optionally contain "bar" use the following search string: +foo bar.

A simple, editable, example should help make this clear.

There is one open question, when a search contains only prohibited terms currently no documents are returned. Technically this should return all documents in the index, but I'm not sure that is too useful. Is the current behaviour correct, i.e. just no search results, or should an InvalidQuery exception be thrown?

An alpha release is available on npm, lunr@2.2.0-alpha.1

drzraf commented 6 years ago

Thank you for this nice feature!

IHMO searches that only contain prohibited terms should have their behavior configurable since the results for the "empty field" is something which vary for each application.

Adding another suggestion:. I found useful that when doing AND-search, some of the terms could be optional according to whether or not they return results.

Ex:

A: 8 matches
B: 3 matches included A + 2 distinct matches
C: 0 matches

search "A or B" : 10
search "A and B" : 2
search "C": 0
search "A and C": 8   # instead of 0: "C" is omitted

Why? When using autocompletion, it makes sense to have "strict" AND-search. But without autocompletion, it's often annoying if a non-existing term(s) drop all the results instead of just refining them, like "+restaurant +pizza +vegan +sunday"

I'm pretty sure such a search-mode exist in full-search engines, I'd naively call that "ignore AND-ed terms that don't provide any result" In such a mode, providing the result-count for each term would let the programmer react to the situation (ex: "no results included "sunday", showing results for the whole week instead")

olivernn commented 6 years ago

Thanks for the feedback, some great suggestions...

IHMO searches that only contain prohibited terms should have their behavior configurable since the results for the "empty field" is something which vary for each application.

It depends on what you mean by 'configurable'... If Lunr threw an InvalidQuery exception then at least the consumer would be able to handle that in whichever way they want, if you squint hard that is sort of configurable :trollface:

I'm pretty sure the current behaviour of returning nothing is not right, I just sidestepped the issue when implementing this. I'd like to see what properly implementing this case looks like. I think it should perform the search, i.e. if all documents do not have the term then return all documents. More interesting is when some documents do have the term, that could easily be a valid use case.

I'm guessing though that you had something more concrete in mind when you mentioned configuring the behaviour. I'm going to pass on this for now, but I'm not ruling it out in the future.

I found useful that when doing AND-search, some of the terms could be optional according to whether or not they return results.

I looked (though not very hard) for this behaviour in Lucene and couldn't find anything, do you have any examples (for any search engine/library)?

It should be possible to implement this in application code though right? If a query returns no results, drop required terms and try again. Possibly not the most efficient, or simple solution I guess.

Rather than implement this directly in Lunr I think Lunr could provide the relevant information (matches per term) so that applications can handle this cases however they like. Probably isn't going to make it into 2.2.x though. It would almost definitely require a major version bump as lunr.Index#query just returns an array of matches currently.

drzraf commented 6 years ago

I don't have examples and I surely don't want such a discussion about exotic feature to block AND-search :)

It should be possible to implement this in application code though right?

yes (but with a performance impact because it could not reuse AND search capabilities). In https://github.com/olivernn/lunr.js/issues/264#issuecomment-346478869 I implement the behavior doing one search per term and then doing intersection of result sets, and then looking at terms that returned none from the intersection.

olivernn commented 6 years ago

I've hacked together an implementation where by searches containing only prohibited terms work as expected. Specifically, if the document does not contain that term it gets returned, even if that means all documents.

The last sticking point is how these results are scored, and therefore sorted. The way I've got it implemented now would mean all results have a score of NaN. This could be the correct result but it feels wrong. I think a score of 1 would be more intuitive. NaN would mean that the score is unrepresentable, where 1 would mean a perfect match.

Fixing the score to be 1 is trivial, but requires even more special casing for these kind of queries, which is something I'd like to avoid.

olivernn commented 6 years ago

Finally got round to working on this again.

Totally negated queries are now supported, each document that does not contain that term will be in the results, the score should be zero for each of them.

An alpha release is available on npm, lunr@2.2.0-alpha.2

I think that covers everything on this feature, if there are no objections I aim to get a release with this and a couple of other small features in the next few days.