[CI][brainstorming] make model training as a github action based on tekton

SamYuan1990 commented 11 months ago

as a brainstorming, if we make model training as a github action which just base on tekton, we can benefits others to provide their training result to us? as he/she can run the github action on their own self hosted github runner targeting with their own k8s cluster with tekton.

https://github.com/sustainable-computing-io/kepler-model-server/blob/0609df43064742887703d4a509c7718bb1010ae1/.github/workflows/train-model-self-hosted.yml#L142-L178

sunya-ch commented 10 months ago

We might prepare another GitHub workflow on specific branch name for pushing a PR with result from their COS to kepler-model-db.

The steps on my thought are

Contributor sets AWS COS secret on their branch.
When train-model-self-hosted or train is called, the updated model will be kept in their COS.
If the branch contains the keyword such as pr-to-kepler-model-db, the to-be-created step like pr-to-kepler-model-db will be applied after model is updated on the COS. This step will run a script to pull latest image from kepler-model-db, read model on COS, run export command.

@SamYuan1990 Do you want to work on this?

Note:

currently only COS on aws is available but we can improve later

SamYuan1990 commented 10 months ago

let's keep collect requirements and ideas in this ticket. I will update my ideas and break down my plans later.

SamYuan1990 commented 10 months ago

here is my plan. @rootfs , @sunya-ch , @marceloamaral at a high level points of view, I would like 3 topics.

Greening CI/CD as use kepler to Greening CI/CD for kepler itself.
Our test case on BM/VM.
Tekton based training.

I am open if we make things implemented by Tekton

Which all those 3 topics basing on our current deployment stack. which also applied for a self hosted instance.(@jiere here)

Note: that promethes/otel + kepler + model server can be deployed by any kind of deployment tooling, either helm, operator or manifests files.

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS.
local-dev-cluster to provide and set up k8s cluster.
kepler action as a github action running on BM to set up k8s.
A new github action based on Tekton to trigger model server training.

Let's start with Tekton based training. As the training result, a model file(s), which we can update to github/open data hub or for self hosted, a private artifactory owned by our user is open for discussion.

About test and verifications I suppose we can reuse kepler model server's training process as traffic loads to the k8s cluster. Which can be just run for verification purposes or with some new test cases. IMO, we can't verify kepler without some workload, hence the workload for training process can be reused.

3rd, a green pipeline As previously in our community, want to base on kepler to build a green pipeline. Hence an interesting question comes out.

Can we make kepler as an example for greening CI/CD pipeline for itself?

if we assume kepler is a workload or a running job for greening CI/CD pipeline. Or in another point of view, running a kepler 's benchmark testing is a part of workload as same as a traffic load running on k8s. Which specific is that the workload is from kepler itself. :-)

sunya-ch commented 10 months ago

Thank you for started this planning.

It seems many points to discuss but let me first start with the requirement for the power modeling.

CICD Test cases for each environment

(A) Test case for BM

0. setup environment

Agree to what you planned:

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS. local-dev-cluster to provide and set up k8s cluster. kepler action as a github action running on BM to set up k8s. A new github action based on Tekton to trigger model server training.

Currently, I reuse the code from local-dev-cluster to create a cluster with some modification of kind configuration, refer to the Kepler deployment from the main repo, customize to patch model server. However, it would be nice if we can update the modification to local-dev-cluster and use Kepler-operator with KeplerInternal CR to deploy model server components.

https://github.com/sustainable-computing-io/kepler-model-server/blob/055f53711f0545327bad5d17af261976370f9e8e/.github/workflows/train-model-self-hosted.yml#L106-L141

1. verify feature inputs from Kepler (input)

verify that all utilization/power metrics have values
verify that the utilization value is correct. with the task to run stress-ng, we can estimate the expected value. Such as, in one sec, the CPU time should be nearly to 1 sec per cores (32 cores in 3 seconds should spend ~96s).
2. verify model training process (process)
verify that it can successfully run without error and produced the model
3. verify trained model results (output)
verify the accuracy of the measurement and predicted power is less than threshold
verify the trained model can be applied by power estimator
verify the accuracy between the exported value from Kepler when using the model to the measurement value
- here we need to update Kepler for a mechanism to disable the measurement even if there is a power meter there.

(B) Test case for VM

1. verify feature inputs from Kepler (input)

verify the expected available metrics on VM have values
verify the utilization value is correct
2. verify estimator (output)
verify the accuracy between the exported value from Kepler to the similar machine powers when using local estimator
verify the accuracy between the exported value from Kepler to the similar machine powers when using sidecar estimator

Integration

trained model delivery

Now, we have CI to push model to kepler project AWS s3 after train

https://github.com/sustainable-computing-io/kepler-model-server/blob/055f53711f0545327bad5d17af261976370f9e8e/model_training/tekton/pipelines/single-train.yaml#L275-L309

planning to extend model server to load mode from s3: https://github.com/sustainable-computing-io/kepler-model-server/issues/213
[Discussion required] We may create a secret to allow only s3:GetObject and keep that inside the kepler base image to allow access from users to load the model from our s3 or we add CI to push a PR to kepler-model-db github and use the URL-based as current one.
We need CI to push PR to Kepler to update its local model data too: https://github.com/sustainable-computing-io/kepler/tree/main/data/model_weight

sunya-ch commented 10 months ago

We also have to think CI pipeline for notifying changes that requires changes and support on the other repo.

For example,

kepler changes metrics (name, labels, values) --> notify kepler-model-server kepler-model-server changes model --> notify kepler-model-db to update the model kepler-model-db updates --> notify kepler to sync

FYI, simplified communication diagram between three repos

will be updated to README page by https://github.com/sustainable-computing-io/kepler-model-server/pull/223

sunya-ch commented 10 months ago

Here is my current refactoring design. Now, most components are done except push-pr-to-db. Still, many help needed.

ci-plan

SamYuan1990 commented 10 months ago

@sunya-ch , your latest comments just for kepler and kepler-model-server? could you please adding other project such as peaks as consideration ? I am interested with what will be.... when we add peaks into consideration.... and how many components we can reuse.

sunya-ch commented 10 months ago

I think we also need people for peak project to list up their requirements.

We can prepare an action to reuse integration test with inputs of kepler image, model_server image, and deployment choice. There are multiple ways to install: 1. by operator 2. by manifests 3. by helm-chart. We may need to prepare all of them to test the integration test.

should be included in operator schedule/push on related repo 3. should be in helm-chart push on related repo 2. should be on kepler and kepler_model_server repo when each other has pushed to main.