hunt-framework / hunt

A flexible, lightweight search platform
59 stars 10 forks source link

Structure of result for completions #82

Open UweSchmidt opened 10 years ago

UweSchmidt commented 10 years ago

In discussions with Sebastian about the completions, we both had the impression the existing structure of the completion is a bit messy to be processed by a client. A nested list of words [(Word,[Word])] is in most cases a bit of an overkill, usually only the last word of a list of words (a phrase) query is of interest, so a [Word] or a [(Word,Score)] for completion of the last word would be sufficient (This only would concern the result, when the query is evaluated all the other words would all be taken into account).

When working on the scoring algorithms for the completions, I've one rather simple algorithm (processing arbitrary queries), that computes a scored list of words suited for completions. In the local version, I've added a new CmdResult variant ResSuggestions that delivers a scored (and sorted by score) list of words. The ResCompletion is still there, but not used.

Is this simplified version o.k.? Is this a version sufficient with the 1. release or do we need the more advanced version too? Any suggestions?

chrisreu commented 10 years ago

To be honest i'm not completely sure how completion is working right now. I remember that we discussed this topic last year and that we changed some things in comparison to Holumbus, but i don't feel like it does what it should do.

I guess we have to define first, what kind of completions we want to support. I think in the end it comes down to this two options:

1) Search: Rudi v This results in a query "Rudi AND v" and gives me completion for words "rudi" and "v" that would result in a query with results. We do only need one list of completion words in this case. But it is not necessarily the list for the last word. It is the list of the word that the user is currently typing. So for this case, i could image two implementations:

2) Search "Rudi v" This results in a phrase query for Rudi v. So in this case the completion should take the whole query in consideration and return only suggestions for words beginning with v following the word "rudi" - This is not working this way right now!. But for this case it would make sense to return only one list.

3) What about complex queries? Let's consider two queries:

So this is a combination of the two options, but it really comes down to the same question. If the server does not know which term the user is currently typing, we would need the completions for all terms.

So...

I could imagine a function that integrates the completion results into the query term (just from the top of my head - no idea if this is really possible). So that the client puts in the queryterm and the completion-results and gets back a list of new queryterms, each with a different completion applied. He could then just show this list within the auto completion box.

chrisreu commented 10 years ago

So concerning the question: I think, if we could make option 2 work, we could ignore option 1 and complex queries in a first release. Then a single list of suggestions would be fine.

But long term, i think the current format is what we really need.

chrisreu commented 10 years ago

I just discussed this completion topic with Ulf and we came up with an idea of how this could be improved. This is not directly related to the initial proposal of this issue, but shows a way to improve the completion by building on top the proposed solution.

So in an ideal case we would support the cases 1 and 2 explained earlier. This way we are able to provide completions for "word" queries and "phrase" queries.

The problem really is, that applications using a search-engine like hunt or elasticsearch generally needs more complex queries. By that i mean, queries that are generated out of an UI containing select-boxes and such things.

So lets consider a common search application. There is a text field like in our demo server, but there are also select-boxes to filter for "color", "car-type", "horsepower" and other things. So the text field is really just a sub-term of big complex generated query.

The completion should only be applied to this sub-term. But the provided suggestions should take the whole complex query into consideration, so that the *auto-completed" query always yields results.

How could we achieve this

The idea is to introduce a new constructor to the query language, like this

data Query
  = QWord    TextSearchType Text  -- ^ Word search.
  | QPhrase  TextSearchType Text  -- ^ Phrase search.
  | QContext [Context] Query      -- ^ Restrict a query to a list of contexts.
  | QBinary  BinOp Query Query    -- ^ Combine two queries with a binary operation.
  | QBoost   Weight Query         -- ^ Weight for query.
  | QRange   Text Text            -- ^ Range query.
  | QCompletionTerm Query
  deriving (Eq, Show)

So this new CompletionTerm is only used in the generation of completions, but is completely ignored in the general query processing. It can be viewed as an annotation for a sub-term. With this addition the client can pass a complex query to the server and define, which sub-term the completion should be applied to.

This annotated sub-term would be restricted to either "Word" or "Phrase" queries. (This could be enforced by the parser or perhaps by the type directly).

How would the completions be calculated? With the new query constructor, the algorithm can extract the sub-term, the completion should be applied to. So for this sub-term, which is really just a query itself, the results could be computed just like it is done now (or with the proposed changes).

After the completions are calculated, each found suggestion needs to be inserted into the original complex query (by replacing the annotated sub-term with the suggestion-term). This should be fairly easy, because we are only dealing with "Word" and "Phrase" terms. Then this query needs to be run through the general query processor - to see if there are results for this particular query.

I think this could be implemented quite easily, but would improve the completion significantly.

This is a lot of stuff to go through here, but i'd love to hear your opinions on this.

UweSchmidt commented 10 years ago

In the last 2 weeks I've worked a bit on a refactoring of the query evaluation with the goal to separate query search, which computes a set of scored docids, from query completion with a set of (scored) words as result.

The key step to reach this is to write two interpreters for these tasks. Then it showes up, there's no need to extend the query language, all info necessary is already there.

The new algorithm computes a list of suggestions for the rightmost primitive search node (a word or prefix search). The usual way, a user constructs a query is extending the query to the right side, so this algorithm is not the most general but solves the most often use case.

The algorithm works like this: for an OR query, the left operand is ignored. The same applies for AND NOT. In the AND case a list of words and occurrences is computed from the right operand, from the left a set of docids is computed and then the right result is filtered by the left set of docids, so words not leading to a successful search are removed.

Context and boost queries are simple, just the sub-queries are evaluated.

For a phrase with a list of words the raw results of all sub-queries are combined to a new raw result by intersecting the occurrence sets (intersection with docidis, position intersection with an offset of 1). This result of a phrase can easily be aggregated to a scored list of words, e.g. by counting the positions and summing up all counts over all docids.

First test show up to be effective, but I've not yet done a test with the Hayoo index. I expect both algorithms to be rather efficient (compared to the existing), due to the rather early aggregation of the results of sub-queries. The partial results are aggregagted in both cases as early as possible.

At one point I've indeed extended the query language. For a simple word search (not a prefix search) there is a similar case to QWord (QFullWord). A phrase with just a single word is transformed (on the fly) into a QFullWord query.

For a real list of words there is an extra case QSeq binop [Query]. A phrase consisting of more than a single word is transformed into a QSeq with a list of QFullWord subqueries. This transformation is also done on the fly. So the current query interface has not changed.

This separation leads to more modularity in the query interpreters, and by the way, to more flexibility. Now it's easy to add search ops to search not just for a sequence of words but to look into e.g. the next 2,3,4,... words or to look into the a range around a word.

chrisreu commented 10 years ago

Okay, just to be sure i got this right.

With this approach the completion for phrase queries works like option2 by default.

What i defined as option1 could be achievied by transforming the query (by a client library perhaps). For example: Search: rud völl. If the user is currently typing völl the completions work out of the box (for the word völl). If the user is currently typing rud, the client could just switch the words and ask completions for the term völl rud. So the resulting query would be völl AND rud instead of rud AND völl. Now he gets completions for word rud.

So basically, what i called annotation of the completion term is done not by annotation here, but by defaulting the completion term to the rightmost term of the query.

So for complex queries, it is the same. Send the "phrase" or "word" query for completion as the rightmost term and you get suggestions. What i could not figure out, is, if this approach already takes the whole complex query into consideration?

Nevertheless, this sounds really cool and will be way bedder than what we got now. How long do you need to finalize this? Maybe you could push the code in a branch until its done?

UweSchmidt commented 10 years ago

I've just pushed the refactored scoring algorithm and the completion algorithm to github. The completion alg. computes suggestions for the very last word or prefix, but it takes into account the other sub-queries to the left.

The more general algorithm, that computes completions for all primitive queries must have a more general result structure, e.g. something like the Query type but with the leaves substituted by lists of words. That would be a cool thing, but it's not simple to compute. With the simple one for the rightmost query the restrictions for the rightmost word are computed by the left sub-queries, but with the general case, the left query can restrict the right one, and the other way round. So info from the result sets flow in both direction from left to right and right to left. I guess, there's no simple solution for this. That's perhaps maybe an interesting Master thesis (?).

I will test the algorithm with the Hayoo index to check whether the scoring works for larger sets of words than with the simple test cases used during development. A tuning of the scoring parameters may be necessary. Currently there runs an indexing process for complete hackage on my local machine.

chrisreu commented 10 years ago

Cool!

How does it work for phrase queries?

This request ...

Exec: Completion {icPrefixCR = QPhrase QNoCase "thomas strunz", icMaxCR = 20}
execCompletion: QPhrase QNoCase "thomas strunz"

... gives me

{"msg":[["Strunz",8]],"code":0}

but:

This request ...

Exec: Completion {icPrefixCR = QPhrase QNoCase "thomas strun", icMaxCR = 20}
execCompletion: QPhrase QNoCase "thomas strun"

... gives me an empty suggestion list.

{"msg":[],"code":0}

I'm trying to integrate the new completions into the user interface. How should the phrase queries look like to produce suggestions?

UweSchmidt commented 10 years ago

In the given query we have to change the exact word search to a prefix search. The phrase search QPhrase ... is currently (on the fly) transformed into

QSeq Phrase [QFullWord QNoCase "thomas", QFullWord QNoCase "strunz"]

QFullWord does the exact search for a single word, like QPhrase with a single word. QWord is the prefix search, so we have to change the above query to

QSeq Phrase [QFullWord QNoCase "thomas", QWord QNoCase "strun"]

to get all suggestions for "strun". Or we make it a bit more flexible by always using QWord to even get answers for `"thom" "strun":

QSeq Phrase [QWord QNoCase "thom", QWord QNoCase "strun"]

QSeq has a binary operator and a list of sub-queries. When processing such phrase queries, it is (like in the old alg.) convenient to have the whole list of sub-queries available, and not only the first 2, as it would be the case for a binary expression, like QBinary op sub1 sub2. So the n-ary QSeq is a bit more comfortable as QBinary and we don't have any issues with associativity.

The easiest way to build these queries is using the Hunt.ClientInterface. This module hides the constructors by giving the user a set of 'smart' constructors. Using that interface gives us the chance to change the internal representation of queries, without breaking the external representation (at least as long as the queries are generated by a Haskell client).

qNext $ map (setNoCaseSearch . qWord) ["thomas", "strun"]

And we can easily extend that interface by more 'smart' constructors, to make the query generation simpler, e.g. by a qNoCaseWord.

UweSchmidt commented 10 years ago

I've pushed a modified ClientInterface to github to make the query from the last comment (and others)

qNext $ map (setNoCaseSearch . qWord) ["thomas", "strun"]

a bit simpler

qNext $ map qWordNoCase ["thomas", "strun"]
chrisreu commented 10 years ago

Cool. I think we should stick to this variant:

QSeq Phrase [QFullWord QNoCase "thomas", QWord QNoCase "strun"]

This way we get all completions for strun that really lead to hits. If we would use QNoCase for every word of the phrase there are cases imaginable that do not lead to hits.

For example: "thoma strun" would result in completion strunz, but the "completed" term "thoma strunz" would not find any document.

Furthermore i think, this transformation has to happen in the query processor. This way it works not only for queries generated with Haskell, but also for queries generated from JSON or Text.

There is a function normQuery right now, which does the transformation for the search. I think we need a specialiced normQuery for the single case "Phrase queries in completion".

I tried to implement that in my latest commit. I'm not a 100% sure if it's the correct place to apply the new transformation, but it looks good to me.

chrisreu commented 10 years ago

I just adjusted the server frontend. There are still some flaws in the javascript (works best in Firefox), but otherwise it looks really really nice.

As long as the parentheses and quotes are closed we even get displayable suggestions for phrase and complex queries like: wer:(Rudi Völ). This is really cool!

I think, there is still some room for improvement for the phrase queries. Think of a phrase search like "Wie so oft". This would not return a lot of suggestions. Maybe we could provide completions as phrases as well like oft liegt auch hier die Mitte in der Wahrheit. Just an idea.

UweSchmidt commented 10 years ago

A generalization of the completion algorithm for queries with a phrase as rightmost sub-query (or just a phrase query) should not be a big deal. It's rather easy to get all word sequences, not only the choices for the last word. The only point is to extend the CmdResult for returning something like a [([Word], Score)] and parameterize the Completions command with a flag determining which form of result has to be delivered.