Popularity scoring of suggest terms?

abstractvector commented 4 years ago

First, thank you for creating this library - I, too, was disappointed in the lack of a lightweight alternative to Elasticsearch. Our use case is a search feature for our website - we're talking hundreds of blog posts, not PB of data!

I've searched but have been unable to find if it's possible to sort the SUGGEST results by popularity (i.e. the number of objects containing that term). It seems all the search terms are alphabetical.

Is such a capability existing, or if not, is it planned? I did take a look at the source code and didn't see an obvious way to incorporate it, but I have no knowledge of Rust so may be missing something obvious.

My proposed workaround in the absence of such a feature is to run the SUGGEST and then a QUERY on each of the suggested terms, counting the results for each. Given the small data sets and the speed of Sonic, I think this is viable for our use case, but open to other ideas too!

valeriansaliou commented 4 years ago

Hello there!

Thanks, I'm super-glad Sonic is useful to a wide range of uses. I guess a lot of people needed the "Redis" of search, as we did at Crisp :)

On your question, unfortunately it's not possible and not planned. For optimization reasons (space on disk, mostly), Sonic stores all words from your search index in an FST, which is a graph of some kind; that maps words character-by-character. For example, "tree" would map too "t => r => e => e" as 4 connected nodes — inserting the word "tea" would only append "e => a" to the graph, as a child graph of the existing "t" root node. This FST graph sole purpose is to correct search typos, the SUGGEST command is a nice "side-effect" of the graph being able to suggest words by start-of-word. Alphabetical ordering of SUGGEST is normal, as in the graph child nodes are alphabetically sorted. Sonic is doing minimal work above that on purpose.

On your work-around, I think I would have done the same. As long as you limit the SUGGEST results, you won't have that many terms to COUNT on. Note that the COUNT command is O(1) AFAIK, so you're not at risk of degraded performances on a larger index over time.

abstractvector commented 4 years ago

Thanks for getting back to me - appreciate it! I have no working knowledge of FST, but your explanation makes sense.

I'm going to go ahead and implement this in our application layer (thanks for validating the approach for me), but I wonder if there might be a possible future opportunity to incorporate this logic into the Sonic protocol, even if not the underlying FST data storage. For instance, offering a parameter on the SUGGEST command that will kick-off count up the instances. This would be transparent to the user, save reimplementation in the application layer, and save on a few network round-trips.

Not a priority idea by any means given it's simply a formalization of the workaround, but thought I'd mention it here for posterity's sake.

valeriansaliou commented 4 years ago

That makes sense. Problem on such a "hot-fix" in the code, is that this could cause huge performance degradation on SUGGESTs that have like 1000s results, as the server would have to order them by COUNT first (so, for N results; N internal counts), and then perform protocol paging to limit eg. to the 10 first items.

That would mean we'd go from O(1) to O(N) in terms of complexity for the SUGGEST command, which is no good.

The proper way to do that would be to store some count meta directly in the FST and change the graph storage format, so that we have a natural ordering by count; which would require all running Sonic V1 instances to upgrade their DB.

valeriansaliou commented 4 years ago

(note that we do have such huge SUGGESTs at Crisp; we have FSTs weighting like 20MB on disk for some of our customers, as they have a very large terms database due to hosting millions of chat messages with us).

valeriansaliou / sonic

Popularity scoring of suggest terms? #192