omsf-eco-infra / gha-runner

A simple GitHub Action for creating cloud-based self-hosted runners.
MIT License
0 stars 3 forks source link

Add opt-in metrics collection #27

Open ethanholz opened 3 weeks ago

ethanholz commented 3 weeks ago

The goal of this issue is to track the ideas and concerns with metrics collection of this GitHub Action. The major goal of this is to ensure that we collect and document the decision process for metrics we want to collect and use.

To start, I am considering the use of the JSON Schema Specification as a way for us to provide a publicly auditable contract with our users. Furthermore, this ensures that we have the capability to build tooling for the metrics we collect by providing a known interface.

Here is the rough outline of things I am interested in collecting:

IAlibay commented 3 weeks ago

Would there be any interest in collecting maximum CPU, GPU, and RAM usage? As discussed elsewhere, we would be interested in getting that data for our own debugging purposes. If the overheads are small, it might be good to get that data on a broader scope to see "how well is an instance type getting used"?

mattwthompson commented 3 weeks ago

I can't currently think of any expectations of privacy with my use of these runners. I'm in favor of my usage being tracked wherever data might potentially be useful

At the level of individual runs, I can only think of resource usage for similar reasons @IAlibay described, and presumably this would also be useful for people to review at an organization level every few months. I, for example, have no idea how much VRAM I'm using or need

ethanholz commented 3 weeks ago

I appreciate the feedback! Getting actually system usage may require some more research to ensure that we can get that data via an API and will likely require some role permission changes on the AWS side. If this is the case, I think this data will be considered "optional telemetry" (especially if we can't ensure a similar API on other cloud providers).

tl; dr: I do think system usage is worth looking into and will update on what I find.

ethanholz commented 2 weeks ago

I was able to find a resource from AWS on adding a CloudWatch script to add GPU monitoring. This would require adding an additional script to these machines (which we should be able to add since we do something similar to install GitHub's runner software).

The question is what additional permissions would be needed on the AWS side to query those CloudWatch metrics.

ethanholz commented 1 week ago

I have a start of a JSON schema (see below) that currently does not handle system metrics. This is the most we can currently handle without needing to modify permissions we recommended in our docs around AWS.

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://github.com/omsf-eco-infra/gha-runner/blob/main/metrics.schema.json",
    "title": "gha-runner metrics",
    "description": "The metrics schema to be collected when using this action",
    "type": "object",
    "properties": {
        "version": {
            "description": "The version of the schema.",
            "type": "string"
        },
        "repository": {
            "description": "The repository using the action.",
            "type": "string",
            "pattern": ".*\/.*" 
        },
        "workflowName": {
            "description": "The name of the workflow that is run",
            "type": "string"
        },
        "trigger": {
            "description": "The GitHub event that triggered the run",
            "type": "string"
        },
        "cloudProvider": {
            "description": "The cloud provider used",
            "type": "string"
        },
        "instanceType": {
            "description": "The cloud instance type",
            "type": "string"    
        },
        "runtime":  {
            "description": "The total runtime of the action",
            "type": "string",
            "format": "duration"
        }
    },
    "required": ["repository", "workflowName", "trigger", "cloudProvider", "instanceType", "runtime"]
}