Open SamYuan1990 opened 11 months ago
We might prepare another GitHub workflow on specific branch name for pushing a PR with result from their COS to kepler-model-db.
The steps on my thought are
pr-to-kepler-model-db
will be applied after model is updated on the COS. This step will run a script to pull latest image from kepler-model-db, read model on COS, run export command.@SamYuan1990 Do you want to work on this?
Note:
let's keep collect requirements and ideas in this ticket. I will update my ideas and break down my plans later.
here is my plan. @rootfs , @sunya-ch , @marceloamaral at a high level points of view, I would like 3 topics.
I am open if we make things implemented by Tekton
Which all those 3 topics basing on our current deployment stack. which also applied for a self hosted instance.(@jiere here)
Note: that promethes/otel + kepler + model server can be deployed by any kind of deployment tooling, either helm, operator or manifests files.
Hence to achieve that, we need to build new and enhancement with current our CI toolings.
Let's start with Tekton based training. As the training result, a model file(s), which we can update to github/open data hub or for self hosted, a private artifactory owned by our user is open for discussion.
About test and verifications I suppose we can reuse kepler model server's training process as traffic loads to the k8s cluster. Which can be just run for verification purposes or with some new test cases. IMO, we can't verify kepler without some workload, hence the workload for training process can be reused.
3rd, a green pipeline As previously in our community, want to base on kepler to build a green pipeline. Hence an interesting question comes out.
Can we make kepler as an example for greening CI/CD pipeline for itself?
if we assume kepler is a workload or a running job for greening CI/CD pipeline. Or in another point of view, running a kepler 's benchmark testing is a part of workload as same as a traffic load running on k8s. Which specific is that the workload is from kepler itself. :-)
Thank you for started this planning.
It seems many points to discuss but let me first start with the requirement for the power modeling.
Agree to what you planned:
Hence to achieve that, we need to build new and enhancement with current our CI toolings.
https://github.com/sustainable-computing-io/aws_ec2_self_hosted_runner to provide us with BM on AWS. local-dev-cluster to provide and set up k8s cluster. kepler action as a github action running on BM to set up k8s. A new github action based on Tekton to trigger model server training.
Currently, I reuse the code from local-dev-cluster
to create a cluster with some modification of kind configuration, refer to the Kepler deployment from the main repo, customize to patch model server.
However, it would be nice if we can update the modification to local-dev-cluster and use Kepler-operator with KeplerInternal
CR to deploy model server components.
Now, we have CI to push model to kepler project AWS s3 after train
planning to extend model server to load mode from s3: https://github.com/sustainable-computing-io/kepler-model-server/issues/213
[Discussion required] We may create a secret to allow only s3:GetObject and keep that inside the kepler base image to allow access from users to load the model from our s3 or we add CI to push a PR to kepler-model-db github and use the URL-based as current one.
We need CI to push PR to Kepler to update its local model data too: https://github.com/sustainable-computing-io/kepler/tree/main/data/model_weight
We also have to think CI pipeline for notifying changes that requires changes and support on the other repo.
For example,
kepler changes metrics (name, labels, values) --> notify kepler-model-server kepler-model-server changes model --> notify kepler-model-db to update the model kepler-model-db updates --> notify kepler to sync
FYI, simplified communication diagram between three repos
will be updated to README page by https://github.com/sustainable-computing-io/kepler-model-server/pull/223
Here is my current refactoring design. Now, most components are done except push-pr-to-db. Still, many help needed.
@sunya-ch , your latest comments just for kepler and kepler-model-server? could you please adding other project such as peaks as consideration ? I am interested with what will be.... when we add peaks into consideration.... and how many components we can reuse.
I think we also need people for peak project to list up their requirements.
We can prepare an action to reuse integration test with inputs of kepler image, model_server image, and deployment choice. There are multiple ways to install: 1. by operator 2. by manifests 3. by helm-chart. We may need to prepare all of them to test the integration test.
some todo item after review kepler CI fix at https://github.com/sustainable-computing-io/kepler/pull/1239
some ideas for self host instance repo, IMO, suggestions below, aiming at use an ansible playbook to set up k8s cluster among 3 ec2 instance created by self host instance GHA.
is there any GHA to set up a k8s cluster via ansible or other CI tools we can reuse, or OCP, container ready? @rootfs wdyt
extend local-dev-cluster with Prometheus operator, tekton targeting to a specific k8s cluster and decouple with kind cluster. hence make the tekton can support kepler-model-server.
@rootfs , @jiere , @sunya-ch wdyt if we have a repo for kepler validation and kepler model server validation, the new repo contains
which
@sunya-ch , @rootfs , @marceloamaral can we use https://github.com/medyagh/setup-minikube to set up minikube for kepler model server training or kepler validation process instead of KIND(k8s in docker), wdyt? if yes, any volume mount settings?
@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here. my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?
once https://github.com/sustainable-computing-io/kepler-action/pull/108 been merged, we will try to use latest kepler-action to integrate with kepler-model-server.
@rootfs , @marceloamaral let's sync up https://github.com/kubevirt/kubevirt solution for validation here. my question is, as for model sever we use CPE frame work as workload ... what kind of workload are we going to use for validation?
@SamYuan1990 Now, the CPE is obsoleted and we now use tekton task/pipeline to run the stress-ng workload and then collect the data. The stress workload includes stressing the CPU up to 100% of all cores.
@SamYuan1990
Based on the discussion about validating the Model here is the setup we want to achieve validation is as follows
We should break down the task according to this issue into separated issues to track the progress. I created a project for power model validation here: https://github.com/orgs/sustainable-computing-io/projects/6/views/1
as a brainstorming, if we make model training as a github action which just base on tekton, we can benefits others to provide their training result to us? as he/she can run the github action on their own self hosted github runner targeting with their own k8s cluster with tekton.
https://github.com/sustainable-computing-io/kepler-model-server/blob/0609df43064742887703d4a509c7718bb1010ae1/.github/workflows/train-model-self-hosted.yml#L142-L178