rluiten / elm-text-search

Full text index engine in Elm language inspired by lunr.js.
http://package.elm-lang.org/packages/rluiten/elm-text-search/latest
BSD 3-Clause "New" or "Revised" License
42 stars 5 forks source link

wildcards #6

Open jschoch opened 8 years ago

jschoch commented 8 years ago

Wondering if you can do wild cards like llo or *ll

Also, can you use AND | OR

p.s. Thanks for a great library!

rluiten commented 8 years ago

There is no support at the moment of AND and OR logic as you are thinking of it I suspect.

It finds matches with all documents that contain any term you are searching for and orders the documents by how close the document matches your search terms with a score. It is doing an "or" for you and the more words that match ("and" like behavior) the higher the search match score is. So more terms matching in a document will return that document earlier in the list of matched documents.

Currently there is not a "not containing" term feature, it could be added though I have no plans to in the short term as I suspect it might complicate things quite a bit.

Note search term is loose, as it tokenises words and then stems words down to core words before indexing and searching. The site for lunr.js http://lunrjs.com/ has a decent explanation of tokenisation and stemming at bottom, it also mentions stop words which are discarded from the document search/match process.

The way it searches for your search terms might not be quite how you expect it to be. As an example of search behavior if you search for a short word "ban", and the collection of all words indexed in all your documents contains the words "ban", "banana", "bandstand", "banter". The search process will actually find documents that contain those extra 3 words that start with "ban" as well as your starting word "ban".

Your welcome, nice to hear people are using it.

jschoch commented 8 years ago

thanks!

I suppose in the "banana" case I wanted to search for "_nana" or better "_nan*" and match banana. guessing from your description this just requires reversing the indexing which it sounds like is now only right to left.

rluiten commented 8 years ago

At the moment its prefix matching, which is common as far as I know in word indexed search models. I do believe it might be possible to add an inverted index but then you would have a prefix and suffix index matches and scoring them relative to each other gets complex and I suspect and the score may not appear to logical to users. With the current prefix model users get a feel fairly quickly and for non technical users I've found prefix matching much less surprising than sub string even in non word index models.

SamGerber-zz commented 7 years ago

Hello! Thanks for building and maintaining this package, and for your detailed responses above. I have a follow up question, as I'm having a bit of difficulty with an implementation we're working on.

Here's how I understand things to work (might be totally wrong):

document: "banned" token list: "ban" query: original: "bann" -> stemmed: "bann" -> expanded: "bann"

To stick with the "ban" example, if my document contained the word "banned" and the search term was "bann", there won't be a match. I think "banned" will get stemmed to "ban", but "bann" will not. Expansion won't help, because we won't have the token "banned", as it's been stemmed.

Is the above accurate, or have I misunderstood something?

If that's the case, it makes search-as-you-type functionality a bit jarring, as the document would be returned for each of these queries: "b", "ba", and "ban". Then it would not match for "bann" and "banne", before matching again on "banned".

Is there a way for me to configure my search differently so that "bann" and "banne" would also match "banned", or is this limitation of stemming that cannot be worked around without wildcards?

rluiten commented 7 years ago

I am not an expert on full text search indexes having implemented this one and done a bunch of reading around the area at the time, but yes I believe your scenario is how it works.

In case you have not run into it, this is the site of the creator of the porter stemmer. https://tartarus.org/martin/PorterStemmer/ QUOTE: "It is often taken to be a crude error that a stemming algorithm does not leave a real word after removing the stem. But the purpose of stemming is to bring variant forms of a word together, not to map a word onto its ‘paradigm’ form. "

You can't take an english stemmer like porter stemmer used here and get its behavior to work well for you in another language.

I do believe your expansion example is how it will work, due specifically to how the word is stemmed and then expanded as it is entered. I do believe you would have to change the core match engine used to get more general prefix match (wild cards after prefx). This woud then increase the search result set and reduce its multi dimensional word matching effectiveness.

Just from personal experience I often find full text indexes annoying to use in certain cases and I think your case is probably a reasonable example of a corner case that can be jarring. I also find that using one often involves typing a word till you see some matches and if they are in the direction you want to go then move onto the next word do not complete the word more than you need it to.

SamGerber-zz commented 7 years ago

Thank you for your response and the resources you shared.

Just to circle back around, by passing an empty list for the transformFactories (bypassing the stemmer), we got the behavior we wanted (all prefixes of "banned" match "banned"). It does increase the size of the index, but in our case, it's a small index and the increase is small, so it's worth it.

rluiten commented 7 years ago

Cool had not considered that, you do lose the value of the stemmer that knows how to make equivalent terms with different endings find the same things, but sounds like in your case the trade is worth it.