sustainable-computing-io / kepler-operator

Kepler Operator
Apache License 2.0
25 stars 26 forks source link

update model server support #235

Closed sunya-ch closed 9 months ago

sunya-ch commented 1 year ago

This PR updates model server support aiming for release v0.6 as mentioned in https://github.com/sustainable-computing-io/kepler-operator/issues/232.

API doc: https://github.com/sustainable-computing-io/kepler-operator/blob/c15c77621958cc79d1921d9af378915158abc4ca/docs/api.md

The PR contains changes on :

Note that, The holder for setting filters and model name is here on kepler: https://github.com/sustainable-computing-io/kepler/blob/73cb11fb963f425013cf7f03f214c8f8b85c7853/pkg/config/config.go#L390. However, it is not determined how to use it. So, it is not supported yet from end to end.

Example configmap change from full deployment on OpenShift on IBM Cloud (kepler CR: config/samples/kepler_full_deploy.yaml)

> oc get configmap -n openshift-kepler-operator model-server-cm -oyaml
...
  MODEL_CONFIG: |
    NODE_COMPONENTS_ESTIMATOR=true
    NODE_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip
  MODEL_SERVER_ENABLE: "true"
  MODEL_SERVER_URL: http://model-server-svc.openshift-kepler-operator.svc.cluster.local:8100

Resources:

> oc get -n openshift-kepler-operator all
NAME                                       READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-48lff               2/2     Running   0          4m10s
pod/kepler-exporter-ds-4nj8g               2/2     Running   0          4m10s
pod/kepler-exporter-ds-5lj62               2/2     Running   0          4m10s
pod/kepler-exporter-ds-9rqlv               2/2     Running   0          4m10s
pod/kepler-exporter-ds-knnsh               2/2     Running   0          4m10s
pod/kepler-exporter-ds-kskmq               2/2     Running   0          4m11s
pod/kepler-exporter-ds-szl8w               2/2     Running   0          2m49s
pod/model-server-deploy-85fd7b8c6d-wkkrg   1/1     Running   0          4m10s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   4m18s
service/model-server-svc      ClusterIP   None         <none>        8100/TCP   4m17s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   4m19s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/model-server-deploy   1/1     1            1           4m18s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/model-server-deploy-85fd7b8c6d   1         1         1       4m18s

exporter log

> oc logs -n openshift-kepler-operator kepler-exporter-ds-szl8w estimator
I0914 08:42:05.585412 3734085 node_component_energy.go:54] Using the EstimatorSidecar/AbsPower Power Model to estimate Node Component Power

estimator log

> oc logs -n openshift-kepler-operator kepler-exporter-ds-szl8w estimator
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip.
clean socket
load model from model server:  /mnt/download/rapl/AbsPower

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com

sunya-ch commented 1 year ago

Please feel free to review on the design first. I'm working on fixing the code bugs (stringvars flag, missing rbac, ...). Will amend the commit once confirm all deployment choices work at least on my cluster.

sunya-ch commented 1 year ago

Now fix critical issue on deployment.

However, please allow me to have some issue left (need help from other to fix with other PR):

sthaha commented 1 year ago

@sunya-ch Thanks a lot for adding the feature 🤗 Please allow us some time to go through the feature implementation. My first focus will be on the spec.modelServer to ensure we have only the minimal set of api exposed.

sthaha commented 1 year ago

@sunya-ch , Thanks a lot for adding this feature 🙇 . You can ignore most of the comments in the review, lets focus on getting the spec.modelserver and spec.estimator parts to the minimal required configuration. We should be able to make assumptions about the model server that is deployed, and thus may not need all the configurations currently in place.

We also need e2e tests to validate most common configuration and scenarios ... The status update of the kepler should also consider the status of these deployments.

Any thoughts on having both model-server and estimator disabled by default? cc: @sunya-ch @rootfs @piparul ?

sunya-ch commented 1 year ago

@sthaha Thank you so much for the review. I made most changes according to your review. I put comment below the review that is modified slightly from your suggestion.

Any thoughts on having both model-server and estimator disabled by default?

Both should be disabled by default. Except, ModelServerSpec is defined. If any value in this section is defined, we should expect local model server by default (enable model server). Again, we open for remote model server. User can put it disable and provide target URL and port for the remote.

sunya-ch commented 1 year ago

Made an update to the review that marked the icon.

Here are example deployments.

minimum deployment

spec:
  exporter:
    deployment:
      port: 9103
oc get -n openshift-kepler-operator all
NAME                           READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-d4ctn   1/1     Running   0          11s
pod/kepler-exporter-ds-fd5xt   1/1     Running   0          11s
pod/kepler-exporter-ds-fzjk7   1/1     Running   0          11s
pod/kepler-exporter-ds-n46xf   1/1     Running   0          11s
pod/kepler-exporter-ds-nthsc   1/1     Running   0          11s
pod/kepler-exporter-ds-qm7p4   1/1     Running   0          11s
pod/kepler-exporter-ds-s5t48   1/1     Running   0          11s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   11s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   11s

with estimator only

spec:
  exporter:
    deployment:
      port: 9103
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip
NAME                           READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-5g5kk   2/2     Running   0          16s
pod/kepler-exporter-ds-7tg9j   2/2     Running   0          16s
pod/kepler-exporter-ds-fh4f2   2/2     Running   0          16s
pod/kepler-exporter-ds-fqdnf   2/2     Running   0          16s
pod/kepler-exporter-ds-lgfwx   2/2     Running   0          16s
pod/kepler-exporter-ds-nthhd   2/2     Running   0          16s
pod/kepler-exporter-ds-pgrl6   2/2     Running   0          16s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   16s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   17s

full deployment

spec:
  exporter:
    deployment:
      port: 9103
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/Linux-4.15.0-213-generic-x86_64_v0.6/rapl/AbsPower/KubeletOnly/GradientBoostingRegressorTrainer_1.zip
  modelServer:
    enabled: true
oc get all -n openshift-kepler-operator
NAME                                       READY   STATUS    RESTARTS   AGE
pod/kepler-exporter-ds-4bsnt               2/2     Running   0          4m48s
pod/kepler-exporter-ds-679tv               2/2     Running   0          4m48s
pod/kepler-exporter-ds-6cmkf               2/2     Running   0          4m48s
pod/kepler-exporter-ds-9ltv4               2/2     Running   0          4m49s
pod/kepler-exporter-ds-c6wnl               2/2     Running   0          4m48s
pod/kepler-exporter-ds-f2l9z               2/2     Running   0          4m49s
pod/kepler-exporter-ds-z5wkg               2/2     Running   0          2m55s
pod/model-server-deploy-85fd7b8c6d-9dwz9   1/1     Running   0          4m49s

NAME                          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/kepler-exporter-svc   ClusterIP   None         <none>        9103/TCP   4m50s
service/model-server-svc      ClusterIP   None         <none>        8100/TCP   4m50s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kepler-exporter-ds   7         7         7       7            7           kubernetes.io/os=linux   4m50s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/model-server-deploy   1/1     1            1           4m50s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/model-server-deploy-85fd7b8c6d   1         1         1       4m50s
sunya-ch commented 9 months ago

moved to https://github.com/sustainable-computing-io/kepler-operator/pull/322