toptal / chewy

High-level Elasticsearch Ruby framework based on the official elasticsearch-ruby client
MIT License
1.89k stars 368 forks source link

Plan for promoting aggregations to first class #192

Closed caldwecr closed 8 years ago

caldwecr commented 9 years ago

Looking ahead to ES 2.0 and reducers, and considering our own work with automatically converting parts of the chewy filter DSL directly into aggregations, I'm taking this chance to reach out and ask: have you thought yet about the approach you want to take? If any?

I'd love to see more convenient methods for parameterization of aggregations (currently missing in the elastic search gem client), aggregation side channel integration into the query results, and direct support in the chewy filter DSL for reductions of aggregations.

I'm sure I can come up with ten more important improvements that I'd love to see, but my intent with this message is to get a feel for your own intent.

Kind regards. Sent from tiny keyboard

pyromaniac commented 9 years ago

Frankly, I didn't have a chance to work with aggregations closely, that's why this part of chewy is not elaborated enough, and unfortunately currently I've got no ideas about it. But I'm open for a discussion, so let's try to design this interfaces and implement it then. I definitely agree this should be done.

I would listen to your proposals first

caldwecr commented 9 years ago

Thinking about this more over the last couple of days, I think the two highest impact areas for updating Chewy with regards to aggregations are the ability to bind aggregate results (from the side channel) to query results and a way to write aggregations that are human readable (they probably would reside within a define_type block).

Before I get too far into spec'ing an interface for those two pieces, I want to knock out the less beneficial but easier 'deep import' problem. (opening a separate issue to present proposed interface.)

caldwecr commented 9 years ago

Lots more coming. But some work in progress for thought. Hoping to produce examples for each type of aggregate (as of 1.5), although that may prove too ambitious.

class StoreIndex < Chewy::Index
  define_type Product do
    root parent: "store", parent_id: -> { store_number } do
      field :product_name
      field :sku, type: 'integer'
      field :manufacturer

      field :parts_warranty_in_days, type: 'integer'
      field :labor_warranty_in_days, type: 'integer'

      field :categories do
        field :category_id
        field :category_name
      end

      field :end_of_life, type: 'date'
      field :price_in_cents, type: 'integer'
      field :cost_in_cents, type: 'integer'

      field :weight_in_kgs, type: 'float'
      field :height
      field :width
      field :length
      field :packaging_type #some kind of enumeration

      field :quantity_on_hand, type: 'integer'
      field :quantity_in_transit, type: 'integer'
    end
  end

  define_type Store do
    field :store_number, type: 'integer'
    field :categories do
      field :category_id
      field :category_name
    end

    # The min aggregation
    #
    # Creates a value field that the result of the aggregation for
    # each query result will be bound to the query result as an attribute
    #
    # Generates an aggregations side channel response of
    # "aggs": {
    #   "lowest_price_in_cents": {
    #     "value": "2315"
    #   }
    # }
    #
    # The reader should accept the hash from above and returns
    # the value that should be returned when lowest_price_in_cents
    # is called for each instance returned from the query
    #
    define_agg :lowest_price_in_cents, reader: :parse_lp_agg do
      min :price_in_cents
    end

    def parse_lp_agg(hsh)
      hsh["lowest_price_in_cents"]["value"] || 0
    end
  end
end

# I can access the aggregate by doing something like this ...
# (returns the hash of lowest price indexed by store number)
sony_products_by_store = StoreIndex::Store.filter { manufacturer == 'Sony' }.aggs([:lowest_price_in_cents])

sony_products_by_store.reduce({}) |res, store|
  res[store.store_number] = res.lowest_price_in_cents
end
=> { 5 => 1299, 77 => 1015, 44 => 987 }
pyromaniac commented 9 years ago

The first thought on this: this pre-defined aggregations should not be run every time, you have to specify running in query, like StoreIndex::Store.filter(...).query(...).aggregations(:lowest_price_in_cents), like if simply symbol passed, then query takes pre-defined aggregations from type definition.

Btw, it is simply possible to use named scopes for aggregations pre-definition. And this resolves a problem of index-wide aggregations pre-definition.

The only problem is binding aggregated values to search results. And here we have to think if we are able to it automatically without any definitions depending on results itself only

caldwecr commented 9 years ago

Some more thoughts.

Regarding the explicit inclusion of particular aggregates in a result: this makes perfect sense. I think it should accept a symbol as well as an array of symbols - I know that the method is composable and could just be called in a chain, but I think aliasing a selection of aggregates and then referring to them by their alias would be useful for this purpose.

cheap_report_aggs = [:lowest_price_in_cents, :some_other_aggregate, :yet_another_aggregate]
StoreIndex::Store.filter { manufacturer == 'Sony' }.aggs(cheap_report_aggs)

Another option might be allowing a user of Chewy to declare an aggregation as lazy. In this fashion it would always be an accessible attribute on any result record, but it would only be calculated if needed or as an optimization. As I don't really need this functionality presently I consider this of relatively low priority.

I'm not sure that the named scopes approach is totally tenable. My initial reaction to the approach was quite positive but reflection has left me suspecting that determining if an aggregate should be present based on a named scope, may be a version of the halting problem. Additionally the named scope creates the impression that the decision of whether or not a particular request should include an aggregation is orthogonal to the calculation of the filter itself.

Maybe I'm totally wrong wrt named scopes - so please share your feedback. At the heart of my concern is that named scopes should either be used as part of the definition process or as part of the execution (query) process, but not both.

In regards to binding aggregate values - how do you see that working? Should the values actually be copied into the record attributes? Should there be some kind of dynamic binding of the resulting aggregations hash and the records?

We actually have extended ES and Chewy's core functionality around aggregations to enable two additional functions:

  1. HAVINGs
  2. Sorting by aggregate values

There are two very different approaches I've considered for this part of the puzzle. The first approach is relatively naive: adding the aggregate values as attributes directly on the resulting records (this is how we do it currently). The second is analogous to a view-model approach in which there is some data store (the aggregates hash) and a means of using that data store as the backing that is actually behind the individual result records.

I will add that we found it more challenging than expected, to reliably extract the values from the aggregate's hash for an aggregation of moderate complexity (depth > 3). It is outright hard when the aggregations require ordering.

caldwecr commented 9 years ago

I propose that as a first deliverable step I create a PR that does the following ...

  1. Adds the define_agg syntax
  2. Updates the existing #aggregations method to accept a symbol or an array of symbols, which when used will cause the query's @aggregations hash to merge in the referenced aggregates

This still doesn't address the binding part of the problem, which I propose to address subsequently to these steps.

@pyromaniac - this comment and the preceding comment are ready for your consideration.

pyromaniac commented 9 years ago

Alright, let's try to submit a PR.

In regards to binding aggregate values - how do you see that working? Should the values actually be copied into the record attributes? Should there be some kind of dynamic binding of the resulting aggregations hash and the records?

Not sure yet, have to think about it

caldwecr commented 9 years ago

@pyromaniac inside of the Query code, is there a way to find out which type the current query is relative to? @_indexes shows all of the indices and each of these has a #types method; but I don't see a way to actually know which type the current query is associated with.

The reason I need this is so that I can namespace named aggregates, allowing a particular aggregation name to be used more than once.

pyromaniac commented 9 years ago

there should be @_types variable as well :)

caldwecr commented 9 years ago

@pyromaniac yes, there is, but how do I know which type the query has as its default type.

Consider

class MyIndex < Chewy::Index
  define_type Foo do
  end

  define_type Bar do
  end
end

# When I do this, the default type is Foo
MyIndex::Foo.filter { match_all }

# When I do this, the default type is Bar
MyIndex::Bar.filter { match_all }

# So inside of the Query code, how do I know if the default type for 
# some query / filter
# is relative to MyIndex::Foo or relative to MyIndex::Bar ?
pyromaniac commented 9 years ago

Actually, there is no way to find default type. You are querying on one or several types and that is all.

If you will do MyIndex.filter { match_all }, then @_types will contain both of the types of currnt index. So you probably have to iterate over the array

caldwecr commented 9 years ago

@pyromaniac can you please take a review of PR https://github.com/toptal/chewy/pull/242

caldwecr commented 9 years ago

This is the direction I'm heading after #242 gets merged.

All of the ES v1.7 aggregations can also be used more conveniently with the "typed" aggregations syntax.

The below example also introduces the syntax for the composition of aggregations. Note that requesting that an aggregation that depends on other aggregations, will also cause those underlying aggregations to be automatically invoked.

class UsersIndex < Chewy::Index
  define_type User do
    field :name
    field :rating, type: "long"

    # Becomes { names: { terms: { field: "name" } } }
    agg :names, terms: "users#user.name"

    # Note that it's best practice to not create a named aggregate with the same name as a field at the same
    # place in the hierarchy. Technically it is permissible because you reference other aggregations with the @
    # symbol prefix
    # The below aggregation definition has the effect of changing the names aggregations hash above to
    # { 
    #   names: {
    #     terms: { field: "name" },
    #     aggs: { 
    #       number_of_each_name: {
    #         value_count: { field: "name" } 
    #       }
    #     }
    #   }
    # }
    agg :number_of_each_name, value_count: "users#user@names"
  end
  define_type Post do
    field :title
    field :body
    field :comments do
      field :message
      field :rating, type: "long"
    end

    # Becomes { min_rating: { min: { field: "comments.rating" } } }
    agg :min_rating, min: "users#post.comments.rating"

    # Becomes { max_rating: { max: { field: "comments.rating" } } }
    agg :max_rating, max: "users#post.comments.rating"

    # Becomes { avg_rating: { avg: { field: "comments.rating" } } }
    agg :avg_rating, avg: "users#post.comments.rating"

    # Becomes { number_of_comments: { sum: { field: "comments" } } }
    agg :number_of_comments, sum: "users#post.comments"
  end
end

I still need to work through each of the 1.7 aggregations types (a fair number of which I've never used) before committing myself to this syntax and composition approach. Thoughts?

pyromaniac commented 9 years ago

I'm still thinking actually. The main thing here: I'm not sure about users#post.comments.rating syntax. But at least for now it doesn't seem that there is another way.