[FEATURE] Index GitHub Events to the Metrics cluster

bshien commented 3 weeks ago

Is your feature request related to a problem?

Coming from https://github.com/opensearch-project/opensearch-metrics/issues/75

In order to index data about maintainer_engagement, there first needs to be GitHub Events indexed into the Metrics cluster.

What solution would you like?

There should be an index in the Metrics cluster called github-activity-events that has documents representing GitHub Events created in the OpenSearch project.

Using the GitHub Automation App, listen on GitHub Events created by the opensearch-project organization. Index a document for each event with these fields:

{
  id, // Unique identifier for the event.
  org.name, // The name of the organization
  repo.name, // The name of the repository.
  type, // The type of event.
  action, // The action that was performed(opened, edited, closed, etc.)
  sender.login, // The username of the actor that triggered the event.
  created_at // The date and time the event was triggered.
}

Document will look like:

{
    "id": "acfc0636-472e-440f-9693-5db93d999fe5",
    "organization": "opensearch-project",
    "repository": "opensearch-metrics",
    "type": "issues",
    "action": "opened",
    "sender": "bshien",
    "created_at": "2024-08-27T00:31:56Z"
}

What alternatives have you considered?

An alternative is using the GitHub Events API to query past Events, but because it is a pull-based system, it is not trivial to add only new Events that have not already been indexed into the cluster.

Do you have any additional context?

https://github.com/opensearch-project/opensearch-metrics/issues/57

dblock commented 2 days ago

The data ingestion problem is a very common one. My general feedback is that because GitHub API is heavily rate-limited we should be storing raw data coming from GitHub somewhere (e.g. S3) first, then having a process that ingests that data into the metrics cluster as close as possible to it original format, then separately aggregating it for the needs of our applications / dashboards, potentially just using OpenSearch aggregation capabilities. This solves a number of general problems.

If we need to change the data format in the cluster we can replay from the raw storage.
If we need a new aggregation we can produce it from the raw data ingested.
There's a clear separation of concerns between obtaining the data, aggregating it, and rendering it.

bshien commented 2 days ago

Thanks for the feedback dB!

Currently the proposed design is to use the automation app, which in a push-based way, can listen on incoming events: https://github.com/opensearch-project/automation-app

Then after an event is heard, index the events as raw data into an index in the Metrics OpenSearch cluster.

Then as a part of https://github.com/opensearch-project/opensearch-metrics/issues/75, the raw data in the index will be reindexed into another index specifically for the purposes of the Maintainer Dashboard. Finally, OSD is used to render the Maintainer Dashboard.

The way the current design differs from your suggestion is that the raw data we index using the automation app is not that close to the original format. It only contains these fields:

{
  "id": "acfc0636-472e-440f-9693-5db93d999fe5",
  "organization": "opensearch-project",
  "repository": "opensearch-metrics",
  "type": "issues",
  "action": "opened",
  "sender": "bshien",
  "created_at": "2024-08-27T00:31:56Z"
}

Also, currently these are the only events that will be listened to: https://github.com/opensearch-project/automation-app/blob/main/configs/operations/github-activity-events-monitor.yml

This makes the raw data fairly specific to the Maintainer Dashboard use case.

Note: We added these limitations to the raw data ingestion because of some concerns of if our OpenSearch cluster can handle all those events with that amount of data.

Do you suggest we use the automation app to store events with all the available data included, and instead of using an OpenSearch cluster, store them in something like S3?

This would fully separate the obtaining the data with the aggregation.

Additionally, a drawback of using the automation app is if the app goes down, doing a backfill is not trivial. This may be relevant if we are building a generic store for GitHub Events

bshien commented 1 day ago

After some discussion, seems like creating a data lake for GitHub Events would be very useful in the future. The proposed design is to use the automation app to upload the events as raw data to a Metrics S3 Bucket. Then, we can index portions of the raw data to the Metrics cluster to leverage its search capability for the use case of the Maintainer Dashboard.

prudhvigodithi commented 1 day ago

Continuing the discussion from https://github.com/opensearch-project/automation-app/pull/24#discussion_r1792350414, we can go with something like, based on the list of events https://probot.github.io/api/latest/classes/context.Context.html#name

s3://opensearch-project-github-events/<event_name>/<date>/repo_name-uuid

Along with it add tags event_type, repo_name, event_date (https://repost.aws/questions/QUxBzMJVu0Sd2uMeBERzXlVA/query-s3-objects-on-tags-values.)

The above should allow

Repo based filtering.
Allow time based filtering
Allow event based filtering.
Possible to get all the events for a repo (without considering the timeframe), by looping through all the events and dates and filter for files with repo_name-*.
Possible to get all the events for a repo within a certain time range.
Possible to get all the events (without considering the repo), by looping through the 1st prefix s3://github-events/<>.
Easy to manage the s3 bucket as in future if required we can delete all events based on certain lifecycle management conditions.

Rest we can fetch the documents from s3 and index to the OpenSearch cluster for more complex filtering.

@dblock @bshien @getsaurabh02 @peterzhuamazon WDYT?

dblock commented 1 day ago

I like it. I wouldn't go overboard in treating S3 as a database though, most importantly you want an ability to quickly replay events for time windows to (re)ingest them in a cluster where you can actually aggregate, sort, etc.

prudhvigodithi commented 1 day ago

Thanks dB, this flatten structure s3://opensearch-project-github-events/<event_name>/<date>/repo_name-uuid would allow us to quickly get the right and required set of documents and later we can index them to the cluster, rest for all other complex queries and operations the idea is to use OpenSearch cluster.

rishabh6788 commented 1 day ago

Agree with S3 approach, this will allows us to have our own data lake of all github events and the consumer can pick and choose how to process it. However streaming data from github bot to S3 may not be straightforward, I believe it we can use stream data from github automation bot to Kinesis Data Firehose, buffer it till a appropriate size, say 100Mb, and then write it to S3.

The write event to S3 can trigger logic to process the data and index wherever required. Kinesis Firehose is a powerful service that acquires, transforms, and delivers data streams. It also has direct integration with OpenSearch service as well.

prudhvigodithi commented 1 day ago

Thanks @rishabh6788, @bshien has created a PR https://github.com/opensearch-project/automation-app/pull/24, for every event the app listens it will upload the s3 bucket. We can initially start with this flow and if its bombarding with too many events and if had some API limitations (where upload is failing), yes we can use some staging tool in between and later push to s3 after certain threshold.

opensearch-project / opensearch-metrics