Closed caldwecr closed 8 years ago
Frankly, I didn't have a chance to work with aggregations closely, that's why this part of chewy is not elaborated enough, and unfortunately currently I've got no ideas about it. But I'm open for a discussion, so let's try to design this interfaces and implement it then. I definitely agree this should be done.
I would listen to your proposals first
Thinking about this more over the last couple of days, I think the two highest impact areas for updating Chewy with regards to aggregations are the ability to bind aggregate results (from the side channel) to query results and a way to write aggregations that are human readable (they probably would reside within a define_type block).
Before I get too far into spec'ing an interface for those two pieces, I want to knock out the less beneficial but easier 'deep import' problem. (opening a separate issue to present proposed interface.)
Lots more coming. But some work in progress for thought. Hoping to produce examples for each type of aggregate (as of 1.5), although that may prove too ambitious.
class StoreIndex < Chewy::Index
define_type Product do
root parent: "store", parent_id: -> { store_number } do
field :product_name
field :sku, type: 'integer'
field :manufacturer
field :parts_warranty_in_days, type: 'integer'
field :labor_warranty_in_days, type: 'integer'
field :categories do
field :category_id
field :category_name
end
field :end_of_life, type: 'date'
field :price_in_cents, type: 'integer'
field :cost_in_cents, type: 'integer'
field :weight_in_kgs, type: 'float'
field :height
field :width
field :length
field :packaging_type #some kind of enumeration
field :quantity_on_hand, type: 'integer'
field :quantity_in_transit, type: 'integer'
end
end
define_type Store do
field :store_number, type: 'integer'
field :categories do
field :category_id
field :category_name
end
# The min aggregation
#
# Creates a value field that the result of the aggregation for
# each query result will be bound to the query result as an attribute
#
# Generates an aggregations side channel response of
# "aggs": {
# "lowest_price_in_cents": {
# "value": "2315"
# }
# }
#
# The reader should accept the hash from above and returns
# the value that should be returned when lowest_price_in_cents
# is called for each instance returned from the query
#
define_agg :lowest_price_in_cents, reader: :parse_lp_agg do
min :price_in_cents
end
def parse_lp_agg(hsh)
hsh["lowest_price_in_cents"]["value"] || 0
end
end
end
# I can access the aggregate by doing something like this ...
# (returns the hash of lowest price indexed by store number)
sony_products_by_store = StoreIndex::Store.filter { manufacturer == 'Sony' }.aggs([:lowest_price_in_cents])
sony_products_by_store.reduce({}) |res, store|
res[store.store_number] = res.lowest_price_in_cents
end
=> { 5 => 1299, 77 => 1015, 44 => 987 }
The first thought on this: this pre-defined aggregations should not be run every time, you have to specify running in query, like StoreIndex::Store.filter(...).query(...).aggregations(:lowest_price_in_cents)
, like if simply symbol passed, then query takes pre-defined aggregations from type definition.
Btw, it is simply possible to use named scopes for aggregations pre-definition. And this resolves a problem of index-wide aggregations pre-definition.
The only problem is binding aggregated values to search results. And here we have to think if we are able to it automatically without any definitions depending on results itself only
Some more thoughts.
Regarding the explicit inclusion of particular aggregates in a result: this makes perfect sense. I think it should accept a symbol as well as an array of symbols - I know that the method is composable and could just be called in a chain, but I think aliasing a selection of aggregates and then referring to them by their alias would be useful for this purpose.
cheap_report_aggs = [:lowest_price_in_cents, :some_other_aggregate, :yet_another_aggregate]
StoreIndex::Store.filter { manufacturer == 'Sony' }.aggs(cheap_report_aggs)
Another option might be allowing a user of Chewy to declare an aggregation as lazy. In this fashion it would always be an accessible attribute on any result record, but it would only be calculated if needed or as an optimization. As I don't really need this functionality presently I consider this of relatively low priority.
I'm not sure that the named scopes approach is totally tenable. My initial reaction to the approach was quite positive but reflection has left me suspecting that determining if an aggregate should be present based on a named scope, may be a version of the halting problem. Additionally the named scope creates the impression that the decision of whether or not a particular request should include an aggregation is orthogonal to the calculation of the filter itself.
Maybe I'm totally wrong wrt named scopes - so please share your feedback. At the heart of my concern is that named scopes should either be used as part of the definition process or as part of the execution (query) process, but not both.
In regards to binding aggregate values - how do you see that working? Should the values actually be copied into the record attributes? Should there be some kind of dynamic binding of the resulting aggregations hash and the records?
We actually have extended ES and Chewy's core functionality around aggregations to enable two additional functions:
There are two very different approaches I've considered for this part of the puzzle. The first approach is relatively naive: adding the aggregate values as attributes directly on the resulting records (this is how we do it currently). The second is analogous to a view-model approach in which there is some data store (the aggregates hash) and a means of using that data store as the backing that is actually behind the individual result records.
I will add that we found it more challenging than expected, to reliably extract the values from the aggregate's hash for an aggregation of moderate complexity (depth > 3). It is outright hard when the aggregations require ordering.
I propose that as a first deliverable step I create a PR that does the following ...
This still doesn't address the binding part of the problem, which I propose to address subsequently to these steps.
@pyromaniac - this comment and the preceding comment are ready for your consideration.
Alright, let's try to submit a PR.
In regards to binding aggregate values - how do you see that working? Should the values actually be copied into the record attributes? Should there be some kind of dynamic binding of the resulting aggregations hash and the records?
Not sure yet, have to think about it
@pyromaniac inside of the Query code, is there a way to find out which type the current query is relative to? @_indexes shows all of the indices and each of these has a #types method; but I don't see a way to actually know which type the current query is associated with.
The reason I need this is so that I can namespace named aggregates, allowing a particular aggregation name to be used more than once.
there should be @_types variable as well :)
@pyromaniac yes, there is, but how do I know which type the query has as its default type.
Consider
class MyIndex < Chewy::Index
define_type Foo do
end
define_type Bar do
end
end
# When I do this, the default type is Foo
MyIndex::Foo.filter { match_all }
# When I do this, the default type is Bar
MyIndex::Bar.filter { match_all }
# So inside of the Query code, how do I know if the default type for
# some query / filter
# is relative to MyIndex::Foo or relative to MyIndex::Bar ?
Actually, there is no way to find default
type. You are querying on one or several types and that is all.
If you will do MyIndex.filter { match_all }
, then @_types
will contain both of the types of currnt index. So you probably have to iterate over the array
@pyromaniac can you please take a review of PR https://github.com/toptal/chewy/pull/242
This is the direction I'm heading after #242 gets merged.
All of the ES v1.7 aggregations can also be used more conveniently with the "typed" aggregations syntax.
The below example also introduces the syntax for the composition of aggregations. Note that requesting that an aggregation that depends on other aggregations, will also cause those underlying aggregations to be automatically invoked.
class UsersIndex < Chewy::Index
define_type User do
field :name
field :rating, type: "long"
# Becomes { names: { terms: { field: "name" } } }
agg :names, terms: "users#user.name"
# Note that it's best practice to not create a named aggregate with the same name as a field at the same
# place in the hierarchy. Technically it is permissible because you reference other aggregations with the @
# symbol prefix
# The below aggregation definition has the effect of changing the names aggregations hash above to
# {
# names: {
# terms: { field: "name" },
# aggs: {
# number_of_each_name: {
# value_count: { field: "name" }
# }
# }
# }
# }
agg :number_of_each_name, value_count: "users#user@names"
end
define_type Post do
field :title
field :body
field :comments do
field :message
field :rating, type: "long"
end
# Becomes { min_rating: { min: { field: "comments.rating" } } }
agg :min_rating, min: "users#post.comments.rating"
# Becomes { max_rating: { max: { field: "comments.rating" } } }
agg :max_rating, max: "users#post.comments.rating"
# Becomes { avg_rating: { avg: { field: "comments.rating" } } }
agg :avg_rating, avg: "users#post.comments.rating"
# Becomes { number_of_comments: { sum: { field: "comments" } } }
agg :number_of_comments, sum: "users#post.comments"
end
end
I still need to work through each of the 1.7 aggregations types (a fair number of which I've never used) before committing myself to this syntax and composition approach. Thoughts?
I'm still thinking actually. The main thing here: I'm not sure about users#post.comments.rating
syntax. But at least for now it doesn't seem that there is another way.
Looking ahead to ES 2.0 and reducers, and considering our own work with automatically converting parts of the chewy filter DSL directly into aggregations, I'm taking this chance to reach out and ask: have you thought yet about the approach you want to take? If any?
I'd love to see more convenient methods for parameterization of aggregations (currently missing in the elastic search gem client), aggregation side channel integration into the query results, and direct support in the chewy filter DSL for reductions of aggregations.
I'm sure I can come up with ten more important improvements that I'd love to see, but my intent with this message is to get a feel for your own intent.
Kind regards. Sent from tiny keyboard