opensearch-project / opensearch-metrics

OpenSearch Metrics
https://metrics.opensearch.org
Apache License 2.0
2 stars 4 forks source link

[FEATURE] Ingest Maintainer last engaged date into Metrics cluster #75

Open bshien opened 2 weeks ago

bshien commented 2 weeks ago

Is your feature request related to a problem?

Coming from https://github.com/opensearch-project/opensearch-metrics/issues/57

As a prerequisite for https://github.com/opensearch-project/opensearch-metrics/issues/73 and https://github.com/opensearch-project/automation-app/issues/8, there needs to be data in the Metrics cluster with information about each maintainers' repo, name, affiliation, the date they were last engaged, and their inactivity status.

What solution would you like?

An index created in the Metrics OpenSearch cluster called maintainer_engagement, which will have documents with this structure:

{
    "id": "8baa664c-dec0-4201-b4b9-9747c2e7ee45",
    "repository": "opensearch-metrics",
    "name": "Brandon Shien",
    "github_login": "bshien",
    "affiliation": "Amazon",
    "event_type": "issues",
    "event_action": "opened",
    "time_last_engaged": "2024-08-27T00:31:56Z",
    "inactive": false
}

To create these documents, there should be a lambda that will use the github-activity-events index(from: https://github.com/opensearch-project/opensearch-metrics/issues/76) to collect/calculate the required fields for each document and index these to the maintainer_engagement index.

This lambda should:

  1. Scrape the MAINTAINERS.md for each repository in the OpenSearch project, and create a mapping between repo and list of maintainers. This will yield the repo, name, github_login, and affiliation fields.
  2. Iterate through the mappings to make a top hit query for the latest document on the github-activity-events index for each repo, maintainer, and event type.
  3. Use the created_at field for each GitHub Event document to get the time_last_engaged
  4. For each event type, calculate if the Maintainer should be considered active or inactive based on time_last_engaged and how active the repo is.

For the inactivity calculation, we can use a linear equation, y = m*x + b, where: x = the total number of events for a repo y = the amount of time a maintainer is inactive before we flag them as inactive

And we can calculate the slope(m) and the y-intercept(b) with two points: (# of events in the repo with the least events, higher bound time to wait(365 days)) (# of events in the repo with the most events, lower bound time to wait(90 days))

This way we have an equation to calculate how long to wait for each repo, we wait longer on repos that are less active, wait shorter on repos that are more active.

  1. Aggregate all event types to a single document which will definitively say whether a maintainer is inactive.

  2. For each event type and the aggregate event, index these documents to the maintainer_engagement index.

Do you have any additional context?

https://github.com/opensearch-project/opensearch-metrics/issues/57

prudhvigodithi commented 2 weeks ago

Thanks @bshien I would even go with splitting the documents at the event level, by adding event_name (coming from https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28) and inactive to true or false for a specific event.

The raw event data collected https://github.com/opensearch-project/opensearch-metrics/issues/76 already has the event name.

By segregating the documents by user (maintainer), repository, and event name, we can obtain more granular metrics for maintainers, allowing us to infer whether they are active or inactive.