olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.97k stars 547 forks source link

Exact phrase matching? #62

Open dannydan412 opened 10 years ago

dannydan412 commented 10 years ago

Hi!

Does lunar support exact phrase matching (i.e. use quotation marks in search)? It doesn't seem like it from what my initial research. I'd like to try and add this feature to the project. Could someone please give me some pointers on how to implement this?

olivernn commented 10 years ago

At the moment lunr tries to be "clever" by automatically adding a wildcard at the end of your search terms, e.g. a search for "foo" becomes "foo*".

I'd like to move away from this, for exactly this kind of issue, it is currently not possible to do an exact match search.

I have some plans to change this, so hold off on implementing anything for now. I need to think through how to implement these changes. I'll be sure to keep you in the loop though, and would very grateful for any help in making these changes.

dannydan412 commented 10 years ago

Hi Oliver, Thanks for getting back to me so quickly! Just to clarify - the current problem with lunr.js is that if I search for a phrase such as '"Hello World"' it would also return documents that contain "Hello Great World". I'm working on a project with a deadline and I was wondering if you have any ideas for a "quick and dirty" solution that I could implement today. Of course I would not commit the code to github. One thought I had was to rebuild the index when phrases are used in the query. This would affect the behavior of the tokenizer to consider the quotes. So if something is in quotes it would be considered a single token. What are your thoughts on this approach?

olivernn commented 10 years ago

It depends, if you only want exact matches, then you can change this code https://github.com/olivernn/lunr.js/blob/master/lib/index.js#L301 to not do the expanding, changing it to get would get you just the token, not any others that are an extension of this term.

Another potential solution is to create n-grams. Basically if you had the text "The quick brown fox" you would treat multiple words together as a 'token'. For a bi-gram, n = 2, you would end up with tokens "The quick", "quick brown", "brown fox" etc. You could extend this to greater number of n, depending on the kind of results you get back. Take a look at adding a processor to the pipeline to do this.

Another idea (not fully thought through) would be to use several instances of lunr together. Maybe one with the n-gram indexes and another with the regular index or even another with the token exapnding etc.

Sorry I can't be of much help here. The changes I've been thinking about alter the way the indexing works in a fairly substantial way and I need to fully understand the implications of it, hence it is taking a little while! Personally I wouldn't worry to much about posting your "quick and dirty" code to github. Create a fork of this project and do whatever you do there, let me know how you get along!

P.S. If you're interested in this kind of thing I can recommend taking a look at - http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf, it might give you a few ideas.

dannydan412 commented 10 years ago

The n-grams sounds like an interesting solution. Isn't using the pipeline too late though in this case? It operates on tokens, so anything I add to the pipeline would operate on single words. Or am I missing something?

olivernn commented 10 years ago

A pipeline function will get called with three arguments, a token, the index of that token, and all the tokens, so you should be able to do what you want with a pipeline function.

http://lunrjs.com/docs/#Pipeline

dannydan412 commented 10 years ago

Isn't it too late to add tokens to the list when the pipeline function gets called? The parent method won't iterate through these newly added objects and so they never get copied to the "final" list of tokens.

olivernn commented 10 years ago

Whatever you return from the pipeline function is used as the input to the next.

var pipeline = new lunr.Pipeline
var bigram = function (token, idx, tokens) {
  return token + " " + tokens[idx + 1]
}

pipeline.run(["The", "quick", "brown", "fox"]) // ["The quick", "quick brown", "brown fox", "fox undefined"]

You would probably have to do something about the undefined.

So if you have this bigram function at the end of your pipeline it will spit out the bigrams, which will then be indexed and searchable. Unless I'm missing something!

idx.pipeline.add(bigram)
dannydan412 commented 10 years ago

The problem is in this case I want the index to contain: "Quick", "Brown", "Fox", "Quick Brown", "Brown Fox" And the pipeline can only return a single token.

olivernn commented 10 years ago

Ah, yes, I see now, sorry for the confusion.

You would need to separate instances of lunr in that case, and your code would have to do the search twice and combine the results.

Sorry I haven't been able to help you much with this problem!

dannydan412 commented 10 years ago

You've been extremely helpful! I was able to achieve a similar result by modifying the tokenizer.

olivernn commented 10 years ago

Cool, out of interest, what modifications did you make?

dannydan412 commented 10 years ago

Let me clean up the code a little bit and I'll post it here. Here's a snippet from the quick and dirty version: https://gist.github.com/dannydan412/8564158

dannydan412 commented 10 years ago

Hey Oliver,

Have you considered adding fuzzy matching support to lunr?

olivernn commented 7 years ago

The latest version (2.0.x) of Lunr supports exact phrase matching and fuzzy matching, more info in the guides.

wdiego commented 7 years ago

Hey @olivernn, I couldn't find the "exact phrase matching" support that you told in Lunr guides. Can you show me where I can find this in the guide?

olivernn commented 7 years ago

@wdiego ah, now that I re-read the this issue, I see that I was confused. I must've though this issue was about exact term matching, which is now supported. Phrase matching, i.e. "foo bar" is not currently supported, sorry to mislead.

928PJY commented 7 years ago

Hi @olivernn So any plan to support Phrase matching?

olivernn commented 7 years ago

@928PJY I want to support it, I just don't know how to implement it in an efficient way yet. I'll re-open this issue.

928PJY commented 7 years ago

OK! Thank you @olivernn, if I have any idea, I will let you know!

jacksongs commented 6 years ago

Hello @olivernn I've been trying to use your code from January 2014 above to offer two-word exact phrase matching, but I can't seem to reconcile it with the docs. Wondering if there has been a change since v2. Can you offer any advice?

My intention is to create an index with both single word and two word tokens.

bengry commented 5 years ago

@olivernn I'm not sure if this issue covers my use-case, but I thought of asking here before creating a new issue - does lunr support exact phrase matching at the moment, or can I add it using a plugin (.use()) externally? So far I wasn't able to get it working. To clarify, what I want is that for the following list of texts:

[
  "foo bar",
  "bar foo",
  "foo bar baz",
  "bar que foo",
]

searching for "foo bar" will only return index 0 and 2. Index 1 doesn't match since the order is wrong (I searched for "foo bar" and it has "bar foo") and index 3 doesn't match since it has the word que in between.

I'm using the latest version of lunr as of now (2.3.8).

biosocket commented 4 years ago

Couldn't exact-phrase matching be achieved using the position meta data?

georg-d commented 2 years ago

For me as a user of a site using lunr / antora, the non-existing "exact phrase search" causes massive redundand manual search efforts: For example, configuration file names contain sub-terms / parts that are fairly common component names, so searching foo bar.ssl.bar produces massive amounts of results for foo and bar – all these results need to be be checked manually whether they really contain the search phrase. Finally, I find out no document contains foo bar.ssl.bat and all results are only results due to automatic "convenience" to not do "inconvenient" exact phrase search but a "sub term search including stemming", so your good intention causes the opposite result 🙉☹ To make the effect tangible: If I had exact phrase search, a task would take 10 secs instead of current 10mins. Sadly, site: search in google etc. also fail because that site has several versions for the same document and google only searches one of them.

Seeing this issue is 8 years old: I'd already be happy with a very simple/naive approach, e.g. a toggle like surrounding the search phrase in "" to turn on exact phrase search mode which does not use any index but crawls live over the plain text like find/grep – while this is technically slower than with an index, it's still finished within 1-1000 milliseconds and the user task is much quicker completed.

Related, but not the same: Issue https://github.com/olivernn/lunr.js/issues/33 that less exact search terms produce higher scores than exact terms