sustainable-computing-io / kepler-model-server

Model Server for Kepler
Apache License 2.0
25 stars 26 forks source link

[CI][brainstorming] make model training as a github action based on tekton #212

Open SamYuan1990 opened 11 months ago

SamYuan1990 commented 11 months ago

as a brainstorming, if we make model training as a github action which just base on tekton, we can benefits others to provide their training result to us? as he/she can run the github action on their own self hosted github runner targeting with their own k8s cluster with tekton.

https://github.com/sustainable-computing-io/kepler-model-server/blob/0609df43064742887703d4a509c7718bb1010ae1/.github/workflows/train-model-self-hosted.yml#L142-L178

sunya-ch commented 10 months ago

We might prepare another GitHub workflow on specific branch name for pushing a PR with result from their COS to kepler-model-db.

The steps on my thought are

  1. Contributor sets AWS COS secret on their branch.
  2. When train-model-self-hosted or train is called, the updated model will be kept in their COS.
  3. If the branch contains the keyword such as pr-to-kepler-model-db, the to-be-created step like pr-to-kepler-model-db will be applied after model is updated on the COS. This step will run a script to pull latest image from kepler-model-db, read model on COS, run export command.

@SamYuan1990 Do you want to work on this?

Note:

SamYuan1990 commented 10 months ago

let's keep collect requirements and ideas in this ticket. I will update my ideas and break down my plans later.

SamYuan1990 commented 10 months ago
image

here is my plan. @rootfs , @sunya-ch , @marceloamaral at a high level points of view, I would like 3 topics.

  1. Greening CI/CD as use kepler to Greening CI/CD for kepler itself.
  2. Our test case on BM/VM.
  3. Tekton based training.

I am open if we make things implemented by Tekton

Which all those 3 topics basing on our current deployment stack. which also applied for a self hosted instance.(@jiere here)

Note: that promethes/otel + kepler + model server can be deployed by any kind of deployment tooling, either helm, operator or manifests files.

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

  1. https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS.
  2. local-dev-cluster to provide and set up k8s cluster.
  3. kepler action as a github action running on BM to set up k8s.
  4. A new github action based on Tekton to trigger model server training.

Let's start with Tekton based training. As the training result, a model file(s), which we can update to github/open data hub or for self hosted, a private artifactory owned by our user is open for discussion.

About test and verifications I suppose we can reuse kepler model server's training process as traffic loads to the k8s cluster. Which can be just run for verification purposes or with some new test cases. IMO, we can't verify kepler without some workload, hence the workload for training process can be reused.

3rd, a green pipeline As previously in our community, want to base on kepler to build a green pipeline. Hence an interesting question comes out.

Can we make kepler as an example for greening CI/CD pipeline for itself?

if we assume kepler is a workload or a running job for greening CI/CD pipeline. Or in another point of view, running a kepler 's benchmark testing is a part of workload as same as a traffic load running on k8s. Which specific is that the workload is from kepler itself. :-)

sunya-ch commented 10 months ago

Thank you for started this planning.

It seems many points to discuss but let me first start with the requirement for the power modeling.

CICD Test cases for each environment

(A) Test case for BM

0. setup environment

Agree to what you planned:

Hence to achieve that, we need to build new and enhancement with current our CI toolings.

https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS. local-dev-cluster to provide and set up k8s cluster. kepler action as a github action running on BM to set up k8s. A new github action based on Tekton to trigger model server training.

Currently, I reuse the code from local-dev-cluster to create a cluster with some modification of kind configuration, refer to the Kepler deployment from the main repo, customize to patch model server. However, it would be nice if we can update the modification to local-dev-cluster and use Kepler-operator with KeplerInternal CR to deploy model server components.

https://github.com/sustainable-computing-io/kepler-model-server/blob/055f53711f0545327bad5d17af261976370f9e8e/.github/workflows/train-model-self-hosted.yml#L106-L141

1. verify feature inputs from Kepler (input)

(B) Test case for VM

1. verify feature inputs from Kepler (input)

Integration

trained model delivery

Now, we have CI to push model to kepler project AWS s3 after train

https://github.com/sustainable-computing-io/kepler-model-server/blob/055f53711f0545327bad5d17af261976370f9e8e/model_training/tekton/pipelines/single-train.yaml#L275-L309

sunya-ch commented 10 months ago

We also have to think CI pipeline for notifying changes that requires changes and support on the other repo.

For example,

kepler changes metrics (name, labels, values) --> notify kepler-model-server kepler-model-server changes model --> notify kepler-model-db to update the model kepler-model-db updates --> notify kepler to sync

FYI, simplified communication diagram between three repos image

will be updated to README page by https://github.com/sustainable-computing-io/kepler-model-server/pull/223

sunya-ch commented 10 months ago

Here is my current refactoring design. Now, most components are done except push-pr-to-db. Still, many help needed.

ci-plan

SamYuan1990 commented 10 months ago

@sunya-ch , your latest comments just for kepler and kepler-model-server? could you please adding other project such as peaks as consideration ? I am interested with what will be.... when we add peaks into consideration.... and how many components we can reuse.

sunya-ch commented 10 months ago

I think we also need people for peak project to list up their requirements.

We can prepare an action to reuse integration test with inputs of kepler image, model_server image, and deployment choice. There are multiple ways to install: 1. by operator 2. by manifests 3. by helm-chart. We may need to prepare all of them to test the integration test.

  1. should be included in operator schedule/push on related repo 3. should be in helm-chart push on related repo 2. should be on kepler and kepler_model_server repo when each other has pushed to main.
SamYuan1990 commented 9 months ago

some todo item after review kepler CI fix at https://github.com/sustainable-computing-io/kepler/pull/1239

SamYuan1990 commented 9 months ago

some ideas for self host instance repo, IMO, suggestions below, aiming at use an ansible playbook to set up k8s cluster among 3 ec2 instance created by self host instance GHA.

is there any GHA to set up a k8s cluster via ansible or other CI tools we can reuse, or OCP, container ready? @rootfs wdyt

SamYuan1990 commented 9 months ago

extend local-dev-cluster with Prometheus operator, tekton targeting to a specific k8s cluster and decouple with kind cluster. hence make the tekton can support kepler-model-server.

SamYuan1990 commented 9 months ago

@rootfs , @jiere , @sunya-ch wdyt if we have a repo for kepler validation and kepler model server validation, the new repo contains

which

  1. the release of the repo can be used for kepler's model training and validation on specific instance.
  2. the release of the repo can be used for investigation for peaks and clever. +@husky-parul , @wangchen615 IMO, when we do investigation for peaks or clever, we need some thing(script?) to build a benchmark? and the benchmark maybe an implementation for cloud native sustainable computing benchmark white paper as a part of https://github.com/cncf/tag-env-sustainability/issues/327 ?
SamYuan1990 commented 9 months ago

@sunya-ch , @rootfs , @marceloamaral can we use https://github.com/medyagh/setup-minikube to set up minikube for kepler model server training or kepler validation process instead of KIND(k8s in docker), wdyt? if yes, any volume mount settings?

SamYuan1990 commented 9 months ago

@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here. my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?

SamYuan1990 commented 8 months ago

once https://github.com/sustainable-computing-io/kepler-action/pull/108 been merged, we will try to use latest kepler-action to integrate with kepler-model-server.

sunya-ch commented 8 months ago

@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here. my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?

@SamYuan1990 Now, the CPE is obsoleted and we now use tekton task/pipeline to run the stress-ng workload and then collect the data. The stress workload includes stressing the CPU up to 100% of all cores.

sthaha commented 7 months ago

@SamYuan1990

Based on the discussion about validating the Model here is the setup we want to achieve validation is as follows

Single Bare Metal

Kepler on Bare Metal

Kepler on VM

sunya-ch commented 6 months ago

We should break down the task according to this issue into separated issues to track the progress. I created a project for power model validation here: https://github.com/orgs/sustainable-computing-io/projects/6/views/1