Open bshien opened 2 weeks ago
Thanks @bshien I would even go with splitting the documents at the event level, by adding event_name
(coming from https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28) and inactive
to true or false for a specific event.
The raw event data collected https://github.com/opensearch-project/opensearch-metrics/issues/76 already has the event name.
By segregating the documents by user (maintainer), repository, and event name, we can obtain more granular metrics for maintainers, allowing us to infer whether they are active or inactive.
Is your feature request related to a problem?
Coming from https://github.com/opensearch-project/opensearch-metrics/issues/57
As a prerequisite for https://github.com/opensearch-project/opensearch-metrics/issues/73 and https://github.com/opensearch-project/automation-app/issues/8, there needs to be data in the Metrics cluster with information about each maintainers' repo, name, affiliation, the date they were last engaged, and their inactivity status.
What solution would you like?
An index created in the Metrics OpenSearch cluster called
maintainer_engagement
, which will have documents with this structure:To create these documents, there should be a lambda that will use the
github-activity-events
index(from: https://github.com/opensearch-project/opensearch-metrics/issues/76) to collect/calculate the required fields for each document and index these to themaintainer_engagement
index.This lambda should:
MAINTAINERS.md
for each repository in the OpenSearch project, and create a mapping between repo and list of maintainers. This will yield therepo
,name
,github_login
, andaffiliation
fields.github-activity-events
index for each repo, maintainer, and event type.created_at
field for each GitHub Event document to get thetime_last_engaged
time_last_engaged
and how active the repo is.For the inactivity calculation, we can use a linear equation, y = m*x + b, where: x = the total number of events for a repo y = the amount of time a maintainer is inactive before we flag them as inactive
And we can calculate the slope(m) and the y-intercept(b) with two points: (# of events in the repo with the least events, higher bound time to wait(365 days)) (# of events in the repo with the most events, lower bound time to wait(90 days))
This way we have an equation to calculate how long to wait for each repo, we wait longer on repos that are less active, wait shorter on repos that are more active.
Aggregate all event types to a single document which will definitively say whether a maintainer is inactive.
For each event type and the aggregate event, index these documents to the
maintainer_engagement
index.Do you have any additional context?
https://github.com/opensearch-project/opensearch-metrics/issues/57