olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

Feature request: add support for multiple wildcards #342

Open giuliac89 opened 6 years ago

giuliac89 commented 6 years ago

Hi Oliver,

lunr.js supports only the wildcard *. Do you think it might be useful to add support for multiple wildcards, or make them configurable by the user?

olivernn commented 6 years ago

Do you mean the actual character used to indicate a wildcard? Why do you need to change the character that represents a wildcard?

Lunr does support terms with multiple wildcards, e.g. *f*o*o* is a valid (though probably slow) query, you can try it out on the demo.

giuliac89 commented 6 years ago

Not change but add more wildcards. For example the wildcard character ?, which represents a single character, so when I search for alon? lunr returns both along and alone

olivernn commented 6 years ago

I see what you mean. Yeah, currently the "" wildcard matches zero or more characters, so a search for `alon` would match "along" and "alone" but also "alongside" which may or may not be what you want.

I'm not against adding different kinds of wildcard. There would need to be changes to the query lexer/parser as well as the lunr.TokenSet for matching them...

giuliac89 commented 6 years ago

Yeah, the "*" wildcard is not enough for my purpouse. I need the '?' wildcard that match exactly one character, so a search for alone? must match only "along" and "alone", but not "alongside".

olivernn commented 6 years ago

Leaving some notes that might help anyone (including myself) who might want to tackle this...

The bulk of the work would be in getting lunr.TokenSet to understand the "?" wildcard. There is already an implementation that creates a token set for strings with edit distance (lunr.TokenSet.fromFuzzyString). There is a reasonable amount of similarity here, since a "?" is just an edit distance of one but at a particular location. The implementation of lunr.TokenSet.fromFuzzy string is long but hopefully quite clear, there are also tests which should make things a little clearer too.

giuliac89 commented 6 years ago

I'll try this tests, but probably I think that it might work. Thanks!

aguynamedben commented 6 years ago

A case I ran into: providing a type-ahead search for contact names.

Using https://olivernn.github.io/moonwalkers/ I want to be able to search for +charles* +duke* (logical AND with a wildcard on each word, so the user could just type ch du and find Charles Duke.

This works: +charles +duke* But it seems like multiple *'s aren't supported (i.e. +charles* +duke* fails to find).

olivernn commented 6 years ago

@aguynamedben I think the specific issue you mention in the demo is due to stemming. There is no "charles" in the index, it is stemmed to "charl".

When a search string contains a term with a wildcard, stemming is disabled. This is because its impossible to know how to correctly stem a partial word. Since there is no term with prefix "charles" in the index it will return no results.

My recommendation for performing type-ahead search is the following:

idx.query(function (q) {
  q.term(lunr.tokenizer(queryString), { boost: 100, usePipeline: true })
  q.term(lunr.tokenizer(queryString), { boost: 10, usePipeline: false, wildcard: lunr.Query.wildcard.TRAILING })
  q.term(lunr.tokenizer(queryString), { boost: 1, editDistance: 1 })
})

That is, search for exact matches with a high boost, then prefix matches with a medium boost (bypassing the pipeline and stemmer) and finally doing a fuzzy search with a low boost.

You could probably improve performance by conditionally applying the fuzzy search only if the word is over a certain number of characters.

aguynamedben commented 6 years ago

@olivernn Thanks for that feedback on the best way to do searches for type-ahead. I'll give it a go.

For fun, here are some interesting things I came across recently:

I think for the kind of search I want to provide (fast results for word fragments, i.e. "flo jpg" finds "flower.jpg"), I should implement trigram indexing like in the Google Code search paper. It looks like there is a stale PR for this on lunr.js.

Does lunr.js plan to merge plugins as opt-in features within lunr.js itself? Being new to this category, it's really useful how postgres and sqlite have opt-in extensions that aren't included out of the box but are easy to enable.

Thanks for the time you spend making lunr.js available to everybody, it's an awesome project.

olivernn commented 6 years ago

Lunr does have support for plugins. In some ways they are quite limited, and in others very powerful. A plugin is just a function that is executed in the context of the index builder, so Lunr doesn't really provide much support. On the other hand, this is JavaScript, so you're pretty much free to monkey freedom patch whatever you like. The internals aren't guaranteed to not change though.

It looks like you are trying to achieve something like fuzzy file path finding? If not then the next few sentences are going to be quite a tangent! Lunr isn't exactly designed for this, though I can't see any reason it won't work. It will all depend on how you index the paths I guess.

Lets say you have a path my_pictures/CountrySide/flower.jpg you probably want to index the following terms ['my', 'pictures', 'country', 'side', 'flower', 'jpg']. Then searching with a combination of fuzzy matching and trailing wildcards might produce reasonable results.

As an aside, this is an area I'm quite interested in, I implemented something in Rust to try out a few ideas in this area.

Thanks for the links, I've added them to the long backlog of interesting things to read!