Adding reverse nested count as a Metrics aggregation

IdanWo commented 6 years ago

Background:

The current plugin supports rendering of reverse nested aggregations out-of-the-box, meaning that when needed in the query - the reverse nested clause is added. When creating a bar chart in kibana, it supports different kinds of aggregations as Metrics (for y-axis): kibana . Most of the available aggregations in ElasticSearch are supported: metrics aggregations such as value count aggregation, bucket aggregations such as top hits aggreation and pipeline aggregations such as max bucket aggregation. It's time to add the Reverse Nested aggregation to the available aggregations in this list.

Motivation:

Each home document is unique. However, same car models might be included in different homes and sometimes the same home might include the SAME car model - if more than one car is purchased. Querying for the top 10 purchased car models is different from querying for the top 10 car models purchased by unique houses. So, if I want to find the top 10 cars which are purchased the most with unique houses I would probably run the following query:

{
    "aggs" : {
        "MyNested" : {
            "nested" : {
                "path" : "Cars"
            }
        },
        "aggs" : {
            "MyTerms" : {
                "terms" : {
                    "field" : "Cars.Model",
                    "Sort" : {
                        "MyReverse":"desc"
                    }
                }
            },
            "aggs" : {
                "MyReverse" : {
                    "reverse_nested" : {}
                }
            }
        }
    }
}

Unfortunately, count aggregation doesn't have an option of choosing a field. So, the only option left is to choose the unique count aggregation and specify the field containing the id of the home. This creates an unnecessary cardinality aggregation, which is statistic and not accurate by default (might be more accurate by specifying the precision_threshold field - but this affects both memory and latency).

{
    "aggs" : {
        "MyNested" : {
            "nested" : {
                "path" : "Cars"
            }
        },
        "aggs" : {
            "MyTerms" : {
                "terms" : {
                    "field" : "Cars.Model",
                    "Sort" : {
                        "MyReverse.MyCardinality":"desc"
                    }
                }
            },
            "aggs" : {
                "MyReverse" : {
                    "reverse_nested" : {}
                },
                "aggs":{
                    "MyCardinality":{
                        "cardinality":{
                            "field":"HomeId"
                        }
                    }
                }
            }
        }
    }
}

Remarks:

I believe that the aim of this plugin is to embed nested support seamlessly, without even mentioning the words Nested or Reverse Nested. Therefore, we should ask ourselves what's the name of this aggregation: is there a reason for adding an explicit aggregation which is called "reverse nested" or not. For example, special "count" aggregation for nested aggregation isn't needed since it's already handled by the current "Terms" aggregation. Our options are probably those:

Extension to current "count" aggregation: this is the best option, but isn't very practical (kibana plugins suppose to add functionality, not to replace existing ones).
"Reverse Nested Count" aggregation: in this aggregation an option to choose a field is added. The default field is "root". If any other field is chosen from the list - the reverse nested would be to its corresponding level. According to home data set, in case root level is chosen it's actually a unique count and not just count - but we can't assume that this is true for any other data sets (it depends in the nature of the data-set, for example whether the document id is generated or not).

ppadovani commented 6 years ago

Ok, first off I gotta say I love how detailed you made this ticket! You even used the test dataset I created for testing what this plugin does. Bravo!

So I read this twice to make sure I understood what you are looking for and I believe this is feasible.

In terms of the options, I would like to point out that this plugin already violates the 'only add functionality' premise in about every way possible. I overload existing code, replace entire objects, UI elements etc. In fact if you look at the terms aggregation, when selecting a nested field, you get an additional option called 'count by parent' which forces a reverse nested back to the parent document.

So, with that being said, the easiest solution to this might be to change the checkbox that I inject into a field selector to allow the user to select the field they want to count by. Then I would change my underlying code to properly generate the native ES query to support that selection. Default behavior would be as it is today, as if the original checkbox was not selected.

Thoughts?

P.S. You didn't specify a Kibana version. My preference would be to put this only in the 6.3.X versions and beyond. I also have a couple of enhancement tickets around scripted fields that I was going to implement for the next major revision of this plugin, so I may include those along with this.

IdanWo commented 6 years ago

Thanks, appriciate your greetings! Nested aggregations in ElasticSearch are complicated, it's important to be well understood when discussing about them and open appropiate issues with high standards.

Haven't developed any kibana plugin so far. It seems that plugin installations not only add derived classes and code files, but can also update existing code files. Pretty impressive, didn't expect it.

I'm still in ElasticSearch 5.6.x. Plan to move in the following months directly to atleast ElasticSearch 6.3.x (for SQL support). Since my current queries make cardinality aggregations on small amount of data (about 1,000 documents per each single cardinality aggregation) - it's okay for now. Fixing it in 6.3.x and above should be just fine.

Can you please emphesize what will be the solution? After the enhacement, I will still pick Count aggregation as Metric aggregation (without any aditional field option) - and in the terms aggregation itself I would mark the checkbox "use count of parent document" and pick the relevant field in the root document?

ppadovani commented 6 years ago

I will definitely provide more detail once I have implemented the solution. It may be something along the lines of what you have outlined... I'll want to try a few things to make sure I pick the most usable solution. I'll update the comments here within the next week or so with what I've decided to implement.

IdanWo commented 6 years ago

I'm sorry, but somehow until now I missed the feature of 'count by parent'. Actually, it did exactly what I needed in my use case since the parent of the bucket was the root document itself. It's my fault, since you did documented it well:

If you wish to have the parent aggregation be the aggregation used for the official count of the bucket contents, you must check the ‘use count of parent document’ box to enable this functionality.

The exact words 'reverse nested' should be mentioned there, as well, in my opinion. Maybe an image with a demonstration of this feature and the query it produces might be more understandable.

To sum up:

In case that a field is found under 2 or more nested levels - the 'count by parent' forces reverse nested back to the direct parent (1 level up only)? If that's the case, the name of this issue should be something like "reverse count by specifying the desired nested level". And the current checkbox should be 'count by direct parent'.
The original sort by "_count: desc" is still left with the addition of the new sorting by reverse nested aggregation. Seems good to me, as long as it's the second sort and not the first. Just to be on the safe side, I removed this default sort by using the JSON input feature of removing keys:

{
    "order" : {
        "_count" : null
    }
}

ppadovani commented 6 years ago

So, I believe the count by parent is based on the root parent document, but I could be wrong. I believe that this feature still has merit in that instead of a checkbox indicating count by parent, you are given the option of selecting the field to count by.

IdanWo commented 6 years ago

I agree. Even if the current implementation is to identify the direct parent, it's becoming hard when more than 2 nested levels are involved. ElasticSearch supports drilling down into multiple nested levels with just 1 nested clause. Selecting the desired nested level would probably make it easier for both implementation and user experience.

Multi level nesting is automatically supported, and detected, resulting in an inner nested query to automatically match the relevant nesting level (and not root) if it exists within another nested query.

Yet another minor enhancement would be the show only fields which are found in the parent docs.

The path cannot contain a reference to a nested object field that falls outside the nested aggregation’s nested structure a reverse_nested is in.

ppadovani commented 6 years ago

I intend to begin work on master for all of the issues tagged as features this weekend. (The weekends are about the only time I have currently to work on this code.)

ppadovani / KibanaNestedSupportPlugin