add model-server/estimator to KeplerInternal

sunya-ch commented 7 months ago

This PR replaces https://github.com/sustainable-computing-io/kepler-operator/pull/235 by moving the integration to Kepler-internal API.

Change summary:

Add Estimator and ModelServer in KeplerInternalSpec and KeplerInternalStatus
Add components/estimator
Implement components/model-server when enabled
Add Model Server Reconcilers to kepler-internal Reconciler
Modify components/exporter to add estimator sidecar if set
Add access Role for deployments and persistentvolumeclaims
Add AddIfNotEmpty and VolumeFromEmptyDir utility function (used for Estimator and ModelServer creation)

Bug fixes (not related to model server):

Correct reference for UP-TO-DATE status
Replace ki.namespace in updateStatus with the deployment namespace from spec.

KeplerInternal

Here is the CR that I used for running in my local cluster.

apiVersion: kepler.system.sustainable.computing.io/v1alpha1
kind: KeplerInternal
metadata:
  annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf
  labels:
    app.kubernetes.io/name: kepler
    app.kubernetes.io/instance: kepler
    app.kubernetes.io/part-of: kepler-operator
  name: kepler
spec:
  exporter:
    deployment:
      image: quay.io/sustainable_computing_io/kepler:release-0.6.1-libbpf
      namespace: kepler-operator
  openshift:
    enabled: true
    dashboard:
      enabled: true
  modelServer:
    enabled: true
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
      total:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip

KeplerInternal Status

With neither estimator nor modelserver

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE    READY   AVAILABLE   AGE   IMAGE   ESTIMATOR      MODEL-SERVER
kepler   9103   7         7         7             7       7           83s   <abbr>  NotInstalled   NotInstalled

With only estimator

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE    READY   AVAILABLE   AGE   IMAGE   ESTIMATOR      MODEL-SERVER
kepler   9103   7         7         7             7       7           40m   <abbr>  Running        NotInstalled

With estimator+modelserver

NAME     PORT   DESIRED   CURRENT   UP-TO-DATE     READY   AVAILABLE   AGE   IMAGE    ESTIMATOR     MODEL-SERVER
kepler   9103   7         7         7              7       7           15s   <abbr>   Running        Running

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com

sthaha commented 7 months ago

@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple kepler-internals

May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs. kepler-internal can have a spec.modelserver.ref to refer to the KeplerModelServer

sunya-ch commented 7 months ago

@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple kepler-internals

May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs. kepler-internal can have a spec.modelserver.ref to refer to the KeplerModelServer

Yes, I generate model server name in the same way of the kepler exporter name (based on CR + suffix of model-server). I think we can keep it together for simplicity of deployment. Each export can connect to different model server. The model server will be created only when it is enabled. But, kepler can specify only model server URL to connect to the other model server (including external).

sthaha commented 7 months ago

@sunya-ch Overall the code looks good to me one missing part is the e2e test. Could you please also add a simple e2e test for validating deployment of model server?

sunya-ch commented 7 months ago

@sunya-ch Overall the code looks good to me one missing part is the e2e test. Could you please also add a simple e2e test for validating deployment of model server?

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part? We can just check the status of KeplerInternal on .status.modelServer.status and .status.estimator.status.

See the issue open here: https://github.com/sustainable-computing-io/kepler-operator/issues/314

sthaha commented 7 months ago

@sunya-ch

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?

The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.

IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.

sthaha commented 7 months ago

@sunya-ch ,

 annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf

that isn't required for kepler-internal it is only a hack enabled for kepler so that stable API users have the ability to deploy libbpf image which are kernel agnostic.

sunya-ch commented 7 months ago

@sunya-ch ,
 annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf
that isn't required for kepler-internal it is only a hack enabled for kepler so that stable API users have the ability to deploy libbpf image which are kernel agnostic.

I see. I have it because first I tried to create with specifying the image name but it turns out that it is not allowed for keplerinternal. Just didn't remove it out ;)

sunya-ch commented 7 months ago

@sunya-ch

I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?

The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.

IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.

Yes, it might be but I still think it should be on another PR for better track. I can rebase this PR from the PR.

sthaha commented 7 months ago

@sunya-ch Hopefull https://github.com/sustainable-computing-io/kepler-operator/pull/325 should help with the model-server testing 👼

sunya-ch commented 7 months ago

convert to draft building on top of the PR https://github.com/sustainable-computing-io/kepler-operator/pull/325. Will rebase with v1alpha1 branch when that PR has merged.

sunya-ch commented 7 months ago

@sthaha Sorry for multiple force-pushed. Most of the main code is not changed the failure comes from my bad code on test case.

Additional change are adding e2e test case and Enabled() function for InternalEstimatorSpec.

https://github.com/sustainable-computing-io/kepler-operator/blob/a4177f9db47817b9a78028e129d37fbbdeb65dfa/pkg/api/v1alpha1/kepler_internal_types.go#L108

sthaha commented 7 months ago

@vprashar2929 could you please help validate this on OpenShift ?.. just ensuring the supported usecase of creating a kepler works is good enough.

vprashar2929 commented 7 months ago

Couple of observations while testing this on OpenShift(4.13):

When deploying KeplerInternals on OpenShift, there is still an issue with showing the appropriate status of Kepler

oc get keplerinternals.kepler.system.sustainable.computing.io                                                                               
NAME           PORT   DESIRED   CURRENT   UP-TO-DATE   READY   AVAILABLE   AGE    IMAGE                                  ESTIMATOR   MODEL-SERVER
mykepler-101   9103   0         0                      0                   168m   quay.io/rh_ee_vprashar/kepler:latest

When the estimator is deployed along with the model server it checks for the wrong model-server service name: mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local openshift-kepler-operator ns doesn't exist

set NODE_TOTAL_ESTIMATOR to true.
set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip.
clean socket
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/acpi/AbsPower
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/rapl/AbsPower

sunya-ch commented 7 months ago

Couple of observations while testing this on OpenShift(4.13):

When deploying KeplerInternals on OpenShift, there is still an issue with showing the appropriate status of Kepler

oc get keplerinternals.kepler.system.sustainable.computing.io                                                                               
NAME           PORT   DESIRED   CURRENT   UP-TO-DATE   READY   AVAILABLE   AGE    IMAGE                                  ESTIMATOR   MODEL-SERVER
mykepler-101   9103   0         0                      0                   168m   quay.io/rh_ee_vprashar/kepler:latest

When the estimator is deployed along with the model server it checks for the wrong model-server service name: mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local openshift-kepler-operator ns doesn't exist

set NODE_TOTAL_ESTIMATOR to true.
set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip.
clean socket
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/acpi/AbsPower
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config:  /mnt/download/rapl/AbsPower

@vprashar2929 Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

vprashar2929 commented 7 months ago

Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

Should we use the namespace from keplerinternal spec ?

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

So when I created keplerinternals I used a different namespace from kepler-operator AFAIK controller only watches kepler-operator ns. If you create an instance with the different namespace to kepler-operator then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312

sthaha commented 7 months ago

@sunya-ch

I see the following difference with the install of kepler using kepler CRD from v1alpha branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.

However, I am okay with having this if you could confirm that there is no impact or fixes an issue

configmap

sunya-ch commented 7 months ago

@sunya-ch

I see the following difference with the install of kepler using kepler CRD from v1alpha branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.

However, I am okay with having this if you could confirm that there is no impact or fixes an issue

configmap

Yes, it has no affect because the default is false. Only enable when it is set to true. (reference: https://github.com/sustainable-computing-io/kepler/blob/442bcfe5d3bf26a1285ff0f13d1f5017d28b9e37/pkg/model/model.go#L189)

sunya-ch commented 7 months ago

Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.

Should we use the namespace from keplerinternal spec ?

For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.

So when I created keplerinternals I used a different namespace from kepler-operator AFAIK controller only watches kepler-operator ns. If you create an instance with the different namespace to kepler-operator then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312

Finally, I guess so, we may allow Kepler to deploy on any namespace for keplerinternal. I put TODO in this PR: https://github.com/sustainable-computing-io/kepler-operator/pull/324/files#diff-435eecb1d2af40ee96747f88ab71eb2758da989c92f76dcb1f714fc1b300c633

We need to have Kepler-operator add the new namespace to the cache. Now, we have to rely on the additional namespace list in the command line. Kepler deployer has to know in advance the namespace list where they are going to allow the keplerinternal to be installed.

https://github.com/sustainable-computing-io/kepler-operator/blob/73505af1859d9f3f097307a2e035d2f2f33e7ff1/cmd/manager/main.go#L90

sustainable-computing-io / kepler-operator