Aggregations - Githubissues

sid88in commented 6 years ago

Aggregations is a controversial topic.

In general key value data stored does not provide aggregations at query time.
For a smaller subset of data it doesn't matter (metrics can be aggregated at client end or server end) preferably at server end (in a separate lambda?) but it matters when we are querying across millions or billions of records (big data?). Writing queries against relational database can be costly in such cases.
GraphQL does't support aggregations (does it?). Its supposed to be a query language to integrate data from multiple data sources but nevertheless you can fetch data and do aggregations at resolver layer (would be costly if you aggregate across big data - point2)

I think we can look at http://druid.io/ as one of the integrations. Druid is open source and aggregates data on fly across millions/billions of events. GraphQL resolver will query Druid Broker (rest api) and aggregate data on fly (assuming the end user has already inserted data in historical nodes). For boilerplate we could let user select if he/she wants to use Druid as a backend for the resolver. If yes, then the user will have to setup druid on EC2 and lambda can fetch/aggregate data from druid endpoints. Something to think about and iterate on.

Any thoughts ?

sid88in commented 6 years ago

@nikgraf I have been able to write aggregation function in resolver using hashmaps for basic aggregations (for now). Next is to do a POC on MYSQL and Druid. I'll keep adding stuff related to this in coming days!

nikgraf commented 6 years ago

Want to make sure we are aligned here. Can you give me an aggregation example?

sid88in commented 6 years ago

Yes let me explain with an example:

Schema: JobID Location DeviceType Clicks Impressions Conversions, date

Records:

[
{JobID: 1,  Location: 2,  DeviceType: 1,  Clicks: 100,  Impressions: 1000,  Conversions, 10, date: 10/22},
{JobID: 100,  Location: 20,  DeviceType: 3,  Clicks: 1010,  Impressions: 10100,  Conversions, 100, 10/22},
.....
{JobID: 1M,  Location: 200,  DeviceType: 2,  Clicks: 2222,  Impressions: 22200,  Conversions, 200, 
02/22},
....
{JobID: 100M,  Location: 2000,  DeviceType: 1,  Clicks: 2022,  Impressions: 3000,  Conversions:100, 03/20},
...
{JobID: 1B,  Location: 199,  DeviceType: 3,  Clicks: 100,  Impressions: 199,  Conversions, 100, 08/15},
]

You can have millions or billions of records in your relational database tables and for queries such as:

Find sum of impressions for a given jobid by devicetype in range 01/2017 - 10/2017 you will have to aggregate impressions across millions/billions of records and this is an extremely expensive query to fetch the result in real time by an API. Also, if you think about pre-aggregating data and storing it in key-value database then you can't scale it because its not possible to store aggregations at every permutation and combination of date ranges and dimensions.
Firing queries like above would require us to sum over these records, group by certain dimension or perform joins in real time any of these will slow down the system drastically.
Find count distinct dimensions - count distincts are costly operations which are solved by druid using hyperloglog.
You can think about building an aggregator in graphql itself but to aggregate this huge amount of data will again increase api latency

serverless / serverless-graphql

Aggregations #121