inveniosoftware / invenio-stats

Statistical data processing and querying for Invenio.
https://invenio-stats.readthedocs.io
MIT License
8 stars 24 forks source link

Aggregations: allow `terms` bucket aggregations #130

Closed max-moser closed 1 year ago

max-moser commented 1 year ago

This PR adds support for bucket aggregations as cousins to the already supported metric aggregations. Its main purpose is to enable simple document counts based on keywords ("how often does each value for a key occur in all events"), e.g. for the boolean via_api flag for record-view events:

reference payload:

{
  "timestamp": "2023-02-21T00:00:00",
  "unique_id": "recid_mnccv-j8r20",
  "count": 3,
  "unique_count": 1,
  "via_api": {
    "true": 2,
    "false": 1
  },
  "countries": {},
  "recid": "mnccv-j8r20",
  "parent_recid": "b9jbe-qq607"
}

Limitations: 1) The bucket aggregations here just add "flat" (keyword count) information to the resulting aggregation, but can't be used to do nested bucketing. That would probably increase complexity a lot and isn't required right now. 2) For now, only terms-type bucket aggregations are supported.

Discussion:

v1:

{
  "timestamp": "2023-02-21T00:00:00",
  "unique_id": "recid_mnccv-j8r20",
  "count": 3,
  "unique_count": 1,
  "via_api": {
    "true": 2,
    "false": 1
  },
  "countries": {
    "portugal": 3,
    "china": 0
  },
  "recid": "mnccv-j8r20",
  "parent_recid": "b9jbe-qq607"
}

v2:

{
  // ...
  "countries": [
    {"key": "portugal", "count": 3},
    {"key": "china", "count": 0}
  ]
}

Use cases: 1) requests on zenodo support: from which countries was the record viewed/downloaded 2)

Charts: https://github.com/inveniosoftware/invenio-stats/issues/120

max-moser commented 1 year ago

:warning: As per discussion with @slint (discord chat here), this feature needs a bit more discussion regarding which use cases we expect in the future & how we want to design the data structure s.t. we're not blocking ourselves in.

max-moser commented 1 year ago

As we've discussed, we couldn't come up with immediate use-cases that would actually depend on this feature being merged in. We only found a few nice-to-have future use cases (show from which countries the records/files were downloaded), but they would need to be fleshed out some more.

The tricky part here is that aggregations are only an "intermediate" result, because the queries still need some way of aggregating the aggregations (events -> aggregations -> queries). As such, the aggregations need to be aggregatable themselves. :woozy_face: