vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.62k stars 589 forks source link

Add support for filtering in grouping over multi-valued fields #15658

Open jobergum opened 3 years ago

jobergum commented 3 years ago

Currently if you group on a multi-valued field grouping uses all values.

There are use cases where one wants to filter the values and only group over the values which matches the filter. A start could be to have functionality similar to matched-elements-only which controls what part of the multi-valued field is returned in the summary, it could be extended to only expose matched elements to grouping. The best would end state would be able to add a filtering constraint language in the grouping language.

MauDagos commented 3 years ago

I'll try to make up an example to add more context and so you can continue scoping the goal of this issue.

Imagine I have this schema:

schema my_doc {
    document my_doc {
        field id type string {
            indexing: attribute | summary
            stemming: none
            attribute {
                fast-search
            }
        }

        struct my_struct {
            field my_type type string {}
            field my_value type int {}
        }

        field my_array type array<my_struct> {
            indexing: summary
            stemming: none

            struct-field my_type {
                indexing: attribute | summary
            }

            struct-field my_value {
                indexing: attribute | summary
            }
        }
    }
}

I would like to find documents of type my_doc by filtering on their multivalue field my_array, and then I would like to group each my_struct struct and aggregate on them separately, instead of aggregating on the documents that matched.

For example, let's take following YQL query:

select * from my_doc
| all(group(my_array.my_type) max(10000) each(
    output(avg(my_array.my_value))
  ));

For this query, Vespa will match all my_doc documents and then group them by my_array.my_type. This means that each value found of my_array.my_type will have it's own group of my_doc documents, which can appear in other groups for other values. Then, when avg(my_array.my_value) is calculated, it's calculated on all of the my_struct structs in the my_array field of all the documents in the group.

This is all known, but I'm just giving details. Personally, I think this is very misleading, but this functionality can stay as is and we can add more functionality by expanding YQL. As such, I have the following ideas for which you can give feedback on.

* Grouping on multivalue fields.

An Elasticsearch style could be taken. You "nest" inside the multivalue field and once there you're grouping and aggregating at this level, instead of on the document level. Here's an example of how the YQL could look like for retrieving the data I want:

select * from my_doc
| all(nested(my_array) all(
    group(my_array.my_type) max(10000) each(
      output(avg(my_array.my_value))
    )
  ));

I can also imagine this new operator being used at any level. For example, group per document and then group and aggregate their multivalue fields:

select * from my_doc
| all(group(id) max(10000)
    all(nested(my_array) all(
      group(my_array.my_type) max(10000) each(
        output(avg(my_array.my_value))
      )
    ));
  );

* Filtering on multivalue fields.

As a next step, it would be great to allow to filter what multivalue fields you want to group. For example, I only want to group on my_struct with a specific my_struct.my_type.

select * from my_doc
where weightedSet(my_array.my_type, {"FOO": 1})
| all(nested(my_array) where(
    (my_array.my_type == "FOO") all(
      group(my_array.my_type) max(10000) each(
        output(avg(my_array.my_value))
      )
    ))
  );

I hope this is enough context for you to continue scoping this ticket. I've not worked enough with Vespa and YQL to be sure that my proposals for expanding YQL are the best options, so feel free to give your feedback and opinions on how the YQL should look.

Kind regards!

lundin commented 3 years ago

+1 on this request. I have a document (describing an item with a SKU) which has a array of structs (attributes) that describes where in a menu-hierarchy an items fits (can be 1 to n places for 1 single document/SKU).

For example one article/document (a plug) needs to appear on these 3 levels in the menu: "menuetree": [ { "level1": "Service", "level2": "Service/maintenance", "level3": "Oil pan drain plug" }, { "level1": "Engine", "level2": "Basic engine", "level3": "Plug, crankcase" }, { "level1": "Engine", "level2": "Lubrication", "level3": "Oil pan drain plug" } ]

on YQL quey i filter on level1 Engine but since grouping (in this case on level2) is on document level (the item/SKU) "Service/maintenance" (even though it belongs to Service) will also be included in the result. Now i can solve that with post process the resulting answer and clean out non matching entries etc. Still would have been nice if array/map followed the filter criteria.

bratseth commented 3 years ago

Thanks, I like these suggestions!

We hope to be able to start working on this in our next sprint starting February 9.

baldersheim commented 3 years ago

I agree that filtering support in multivalued fields would be nice. However I would like to point out that you can achieve what you want by using parent-child relation, see https://docs.vespa.ai/en/parent-child.html.

Then you would query over the child and you should be able to achieve the behaviour you want.

If most of your data are in the array field I would probably choose a parent-child approach. But if it is a small part of your document, not naturally separated I would go for the array approach.

This being said I do think that adding a filter criteria to grouping will make sense. Just make sure you select the proper tool for the job. Going out in the garage for the sledgehammer when you need a small nail in the wall for hanging a small picture might be impressive, but probably not the better solution in the long run.

MauDagos commented 3 years ago

Thanks for the reminder of parent-child. However, for our use-case, it was recommended by the Vespa team itself that we don't follow that structure because of two reasons (which can't be seen from my simple example in my first comment):

  1. We're dealing with millions and millions of documents. With parent-child relationship parent documents need to be on all nodes, so scaling becomes problematic.
  2. Imported fields from parent documents can only be attributes, which do not support free text matching. For our use case we need free text search on fields on both the parent and the child documents. There's a ticket open for this feature: https://github.com/vespa-engine/vespa/issues/12333
baldersheim commented 1 year ago

I think this one is due now.

I suggest we split this in 2 as it is 2 orthogonal features. 1 - Nesting. This is the natural behavior you would expect when grouping and aggregating over structured data. If we allow the semantic change for structured fields this can be done without any yql changes. All the structured information is in the schema already and can be extracted directly runtime in the backend.

2 - Filtering. This is an orthogonal feature that might also make sense for multi value primitive fields.

bratseth commented 1 year ago

Let's use this issue for filtering (item 2).

I created https://github.com/vespa-engine/vespa/issues/27238 on nesting (item 1).

ernestas-poskus commented 12 months ago

+1 on this request.

kkraune commented 11 months ago

we will continue working on this in Q4