vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.73k stars 597 forks source link

Add exclusions to aggregation #26627

Open eostis opened 1 year ago

eostis commented 1 year ago

E-Commerce facets often require to display "excluded" facet items from current selection.

For instance, let's say we selected the "Shoe" category. Each shoe has exactly one colour: blue or red.

If we select the blue filter, the red filter disappears, no way to also display red shoes: Color  X Blue

But if we want to show the red excluded filter also, with a "OR" on the colour filter:

  1. Show only blue shoes, but display the red excluded filter Color  X Blue   Red

  2. Show blue or red shoes Color  X Blue  X Red

(See https://solr.apache.org/guide/solr/latest/query-guide/faceting.html#tagging-and-excluding-filters)

bratseth commented 1 year ago

The grouping result is generated from the matches. If you want to generate a grouping result covering the complete universe of possible choices you can do that by sending another query just for grouping (hits=0), with the query "true".

Maybe it could be worth while to have a built-in Searcher that can do this.

eostis commented 1 year ago

Exclusion is applied to the current color filter, but other filters must be applied. Which means we would build a second query with the same filters but the color filter to get the excluded color facet values.

And we must do that for each excluded facet (color and size for instance). There can be tens of excluded facets, so performance is an issue, and all excluded queries must be called in parallel (same as #26579)

I will do just that for my WooCommerce plugin. But it could be nice to build it in Vespa's yql syntax, as this is a very common e-commerce requirement.

Exclusion seen on a world-famous e-commerce site:

image image image
bratseth commented 1 year ago

Why can't you do one query with no exclusions to get all these alternatives?

eostis commented 1 year ago

Because exclusion is only among a facet items, not between facets.

Here is the original query with 2 non-excluded facets color and size, returning one facet color and one facet size: select * from my_type where (post_type matches "product") and (color matches "blue") and (size matches "xxl") all( all(grouping(size) each(output(count())) all(grouping(color) each(output(count())) )

Here is the query for excluded size, returning all sizes for the blue color: select * from my_type where (post_type matches "product") and (color matches "blue") all(grouping(size) each(output(count()))

Here is the query for excluded color, returning all colors for the xxl size: select * from my_type where (post_type matches "product") and (size matches "xxl") all(grouping(color) each(output(count()))

bratseth commented 1 year ago

In your example picture above is a gui which (I believe) shows all the options of each facets. That can be produced by one query, recalling all documents (or all documents matching the part of the query not set by facet selection).

It also shows some facet values being selected, which leads to some choices of other facets being deactivated. The information needed for this is also one query, adding all selected facets as filters and deactivating all facet values not present in the resulting grouping.

What am I missing?

eostis commented 1 year ago

Exactly: it would require one query (no exclusion) + one query per excluded facet with a selection

In my previous example:

And so on if more excluded facets are selected.

bratseth commented 1 year ago

Yes, it will be one query to respond to each new action taken by the user. That is normal and as expected right?

And if we assume no state in the frontend, we'll need also need to redo the query without selection, so we get 2 queries per new user action. But we never need more than 2 queries at most per user action - agreed?

eostis commented 1 year ago

No, because you must count excluded facets selected previously.

Total: for the whole sequence (1+ 2 + 3 + .. + 1 + N) = 1+ N*(N+1)/2 queries (instead of 1+N with non-excluded facets)

The examples above with color and size required 3 queries, because both color and size where selected.

bratseth commented 1 year ago

Yes, N user actions give N+1 queries in total (not counting the first user action of opening the page/app). Agree?

eostis commented 1 year ago

No, the last action gives N+1 queries.

Total actions give O(N*N) queries

Total: 1 +2 + 3 + 4 = 10 queries

bratseth commented 1 year ago

Okay, back to the original question:

Why does selecting 3 facets take 4 queries (assuming no state in the client)?

What additional information do you need at this point which is not returned by these two queries?

eostis commented 1 year ago
  • You get all choices for all facets by issuing one query selecting everything.

More precisely, you get all choices for all non-excluded facets

  • You get the results for the current selection (of one or more choices of 3 facets), plus information about which choices to deactivate (those that are not present in the facets results of this query), from one query.

We need to get the 3 aggregations, but each of them requires different where filters (same filters for each, without the current selection). Which gives 3 queries.

Notice the filters below: the non-excluded filter remains, while each filter on the aggregation is removed: select * from my_type where (post_type matches "product") and (color matches "blue") all(grouping(size) each(output(count())) select * from my_type where (post_type matches "product") and (size matches "xxl") all(grouping(color) each(output(count()))

bratseth commented 1 year ago

We need to get the 3 aggregations

Why, i.e what information do you gain from this that is missing otherwise?

eostis commented 1 year ago

Because the 3 aggregations cannot be calculated with the same where filters. Each one is calculated with the facet selection excluded (removed) from the where filters.

Original query: (facet1 matches "value1") and (facet2 matches "value2") and (facet3 matches "value3")

Gives:

bratseth commented 1 year ago

The first query (matching all) will give you all values of facet1, facet2 and facet3.

eostis commented 1 year ago

The (matching all) query gives all values for facet1, facet2, facet3.

But not all values for:

bratseth commented 1 year ago

Yes, but those are subsets. You need the subset matching the current selection (of facet1,2,3), but what do you need the other subsets for?

eostis commented 1 year ago

Let's go back to more precise examples.

id: 1

id: 2

id: 3

id: 4

1) Filter color("blue") and size("10"), no exclusions: select * from my_type where (post_type matches "product") and (status matches "published") and (color matches "blue") and (size matches "10") all( all(grouping(color) each(output(count()))) all(grouping(size) each(output(count()))) all(grouping(material) each(output(count()))) )

Color
- X blue (1)
Size
- X 10 (1)
Material
-   Textile (1)

2) Filter color("blue") and size("10") with exclusion on color: select * from my_type where (post_type matches "product") and (status matches "published") and (size matches "10") all( all(grouping(color) each(output(count()))) )

Color
- X blue (1)
-   red (1)

3) Filter color("blue") and size("10") with exclusion on size: select * from my_type where (post_type matches "product") and (status matches "published") and (color matches "blue") all( all(grouping(size) each(output(count()))) )

Size
- X 10 (1)
-   11 (1)

Let's combine aggregations from (1) (2) (3), to see expected excluded color(red) and size(11):

Color
- X blue (1)
-   red (1)
Size
- X 10 (1)
-   11 (1)
Material
- Textile (1)

Results and non-excluded facet aggregations are coming from query (1). Query (2) and (3) are there only to calculate aggregations on excluded facets.

bratseth commented 1 year ago

Results and non-excluded facet aggregations are coming from query (1)

Yes.

Query (2) and (3) are there only to calculate aggregations on excluded facets.

Can't you replace 2) and 3) by

select * from my_type where (post_type matches "product") and (status matches "published") all( all(grouping(color) each(output(count()))) all(grouping(size) each(output(count()))) all(grouping(material) each(output(count()))) )

This returns

Color
-  blue (2)
-  red (2)
Size
-  10 (2)
-  11 (2)
Material
- Textile (2)
- Leather (2)

You also have

Color
- X blue (1)
Size
- X 10 (1)
Material
-   Textile (1)

From 1).

Now you can combine them to create a view like in the image above:

Color
-  X blue
-    red
Size
-  X 10
-    11
Material
- Textile
- (Leather)

That said, I think there is something in that image (but not in your example) that you won't get by 2 queries: You won't get the information necessary to differentiate between other alternatives of facets that are selected that are selectable given the choices in other facets, that is the difference between selectable and non-selectable shoe sizes in that image.

eostis commented 1 year ago

Now you can combine them to create a view like in the image above

The result should give - red (1), but the (match all )query gives - red (2) while the original query does not return red at all. How to produce - red (1) from both?

But my original feature request is to have Vespa manage the whole thing server-side, from a YQL syntax addition. My examples are just here to explain what happens, and how one could do it client-side in the meantime.

From my previous implementations, here are some hints:

bratseth commented 1 year ago

Yes, 2 Red instead of 1 is an example of the same missing info (however, in that case I don't think it's necessarily better to show 1 than 2 if you show numbers at all since it may not be obvious to users what that number means.)

Anyway, this is clear to me now, thanks for the patience!

eostis commented 1 year ago

Closing inactive issue.

bratseth commented 1 year ago

On our backlog.