Figure out how to make wildcards do the right thing

hoelzro commented 6 years ago

@diego898 brought up wildcards in issue #5, but the way they work is a little unintuitive for users because of how lunr works with regards to stemming. For example, looking for title:format*ing yields no results, because formatting is stemmed down to format.

@olivernn If you have a minute, could you chime in on how I could use lunr to make wildcards behave more intuitively for users? I suppose I could disable stemming, but that would interfere with other features I'm trying to offer.

olivernn commented 6 years ago

Stemming, or at least having stemming enabled by default, in Lunr has lead to many questions like this. I think it makes sense to take a step back and understand why there is stemming in the first place. Stemming really serves two roles; reducing the size of an index and increasing recall.

In most search engines reducing the size of an index is less of a concern. For Lunr the size of the index is a concern since the entire index must be sent over the wire before any search can take place. That said I don't know how much space is being saved by stemming. Tokens (or words) are stored twice in a Lunr index, in the inverted index that maps tokens to documents and also in the token set which is used to perform lookups based on queries, including matching wildcards.

lunr.TokenSet uses a data structure that compresses both prefixes and suffixes, that means that if you have N documents ending in say "ing" then those three characters in sequence should only be stored once. The inverted index on the other hand is basically a map from token to documents.

Perhaps there is a case for storing the stemmed token in the inverted index, but not stemming the token in the token set. This should remove some of the unexpected behaviour when searching for "format*ing" for example. Once the search term has been expanded using the token set it the results would need to be stemmed to perform the lookup against the inverted index, but maybe this isn't terrible.

Any change would also need to keep in mind the recall use case for stemming, e.g. a search for "formatting" should probably return results for documents containing "formatted" for example. As such the token set would still need to contain the stemmed form of a token.

I would need to look at both the index size and performance hit of this hybrid approach, but perhaps it is the right approach?

hoelzro commented 6 years ago

@olivernn Thanks for weighing in!

So I'm not really concerned about the size of the index; the index is generated on the fly or loaded from local storage. Obviously I don't want the index size to get too out of hand, but it's not a huge deal if it gets a little bigger. I'm mostly focused on recall - your example of searching for "formatting" matching "formatted" perfectly illustrates why I used lunr to write this plugin in the first place!

Your idea of storing the unstemmed tokens alongside stemmed ones in the token sounds promising; do you think that's something I would be able to implement as "addon" functionality, like I did with lunr-mutable-indexes? It seems the token set doesn't explicitly make it into the serialized version of the index - so if the unstemmed version of a token were to be preserved in the token set, it would need to be serialized separately, right?

hoelzro commented 6 years ago

Something that dawned on me while taking a walk - users such as @diego898 might not want format*ing to match a document with "format" in the title.

olivernn commented 6 years ago

It seems the token set doesn't explicitly make it into the serialized version of the index

Yeah, I forgot about this, it makes what I suggested about having extra tokens in the token set a non-starter at this point. It would need a change in the serialisation format which would require a major version bump.

do you think that's something I would be able to implement as "addon" functionality

Yes! You could have a slightly different stemmer that would also preserve the original term, i.e. return both the stemmed and unstemmed token from the stemmer. This comes with the caveat that the index size will increase. You'd have to do some testing to see how much it increases and whether it is feasible or not though.

One possibility that I've not put too much thought into (so take it with a pinch of salt) would be to 'extend' the serialisation format. I.e. include the set of tokens you want to end up in the token set?

hoelzro commented 6 years ago

Thanks for your input @olivernn - I really appreciate it! I'll do some testing and iteration on this idea and see how it goes!

hoelzro commented 5 years ago

Ok, it took me two months, but I finally got around to playing with this (having kids is hard!). The trick to store unstemmed tokens in the token set didn't quite work out as planned; the query clause for sett*le is configured not to use the search pipeline, and I think things would get weird if I forced it to. So I opted for the simpler "just put the unstemmed version in the inverted index, too" trick - it takes roughly 33% longer to index my wiki, and the resulting index is about 50% bigger, but I think this might be worth it to some users. I think in the interest of having users like @diego898 try this out, I'll put this behind a configuration setting.

hoelzro commented 5 years ago

Ok, this is finally "done" - I'm going to need users like @diego898 to experiment with this and really kick the tires, but I have a good feeling about it.

hoelzro / tw-full-text-search

Figure out how to make wildcards do the right thing #9