Closed sunya-ch closed 7 months ago
@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple kepler-internals
May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs.
kepler-internal
can have a spec.modelserver.ref
to refer to the KeplerModelServer
@sunya-ch does this support deployment of multiple model-servers like it currently allows multiple
kepler-internals
May be worth considering is .. if it makes sense to have a separate CRD for KeplerModelServer. This would allow for multiple model servers to be deployed (if that is even a case) with different configs.
kepler-internal
can have aspec.modelserver.ref
to refer to theKeplerModelServer
Yes, I generate model server name in the same way of the kepler exporter name (based on CR + suffix of model-server
). I think we can keep it together for simplicity of deployment. Each export can connect to different model server. The model server will be created only when it is enabled. But, kepler can specify only model server URL to connect to the other model server (including external).
@sunya-ch Overall the code looks good to me one missing part is the e2e test. Could you please also add a simple e2e test for validating deployment of model server?
@sunya-ch Overall the code looks good to me one missing part is the e2e test. Could you please also add a simple e2e test for validating deployment of model server?
I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?
We can just check the status of KeplerInternal on .status.modelServer.status
and .status.estimator.status
.
See the issue open here: https://github.com/sustainable-computing-io/kepler-operator/issues/314
@sunya-ch
I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?
The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.
IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.
@sunya-ch ,
annotations:
kepler.sustainable.computing.io/bpf-attach-method: libbpf
that isn't required for kepler-internal
it is only a hack enabled for kepler
so that stable API users have the ability to deploy libbpf image which are kernel agnostic.
@sunya-ch ,
annotations: kepler.sustainable.computing.io/bpf-attach-method: libbpf
that isn't required for
kepler-internal
it is only a hack enabled forkepler
so that stable API users have the ability to deploy libbpf image which are kernel agnostic.
I see. I have it because first I tried to create with specifying the image name but it turns out that it is not allowed for keplerinternal. Just didn't remove it out ;)
@sunya-ch
I cannot see the e2e test for the KeplerInternal CR. I think it would become too big change on this PR to add the e2e test for the keplerinternal. Could that be done by the other PR then I could help add the model server part?
The current Kepler tests cover almost all the usecase currently supported by kepler-internal, so having a set of tests that replicate what creation of kepler already does gives us only a low ROI.
IMHO, all features should be be accompanied by tests that validate most common usecases. It shouldn't be too hard to add an e2e by making a copy of the existing kepler-e2e.
Yes, it might be but I still think it should be on another PR for better track. I can rebase this PR from the PR.
@sunya-ch Hopefull https://github.com/sustainable-computing-io/kepler-operator/pull/325 should help with the model-server testing 👼
convert to draft building on top of the PR https://github.com/sustainable-computing-io/kepler-operator/pull/325. Will rebase with v1alpha1 branch when that PR has merged.
@sthaha Sorry for multiple force-pushed. Most of the main code is not changed the failure comes from my bad code on test case.
Additional change are adding e2e test case and Enabled() function for InternalEstimatorSpec.
@vprashar2929 could you please help validate this on OpenShift ?.. just ensuring the supported usecase of creating a kepler
works is good enough.
Couple of observations while testing this on OpenShift(4.13):
oc get keplerinternals.kepler.system.sustainable.computing.io
NAME PORT DESIRED CURRENT UP-TO-DATE READY AVAILABLE AGE IMAGE ESTIMATOR MODEL-SERVER
mykepler-101 9103 0 0 0 168m quay.io/rh_ee_vprashar/kepler:latest
mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local
openshift-kepler-operator
ns doesn't exist
set NODE_TOTAL_ESTIMATOR to true.
set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.
set NODE_COMPONENTS_ESTIMATOR to true.
set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip.
clean socket
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config: /mnt/download/acpi/AbsPower
cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)"))
get archived model
get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip
try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower
<Response [200]>
load model from config: /mnt/download/rapl/AbsPower
Couple of observations while testing this on OpenShift(4.13):
- When deploying KeplerInternals on OpenShift, there is still an issue with showing the appropriate status of Kepler
oc get keplerinternals.kepler.system.sustainable.computing.io NAME PORT DESIRED CURRENT UP-TO-DATE READY AVAILABLE AGE IMAGE ESTIMATOR MODEL-SERVER mykepler-101 9103 0 0 0 168m quay.io/rh_ee_vprashar/kepler:latest
- When the estimator is deployed along with the model server it checks for the wrong model-server service name:
mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local
openshift-kepler-operator
ns doesn't existset NODE_TOTAL_ESTIMATOR to true. set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip. set NODE_COMPONENTS_ESTIMATOR to true. set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip. clean socket cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca34a02430>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)")) get archived model get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower <Response [200]> load model from config: /mnt/download/acpi/AbsPower cannot make request to http://mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local:8100/model: HTTPConnectionPool(host='mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local', port=8100): Max retries exceeded with url: /model (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fca33e17460>: Failed to resolve 'mykepler-101-model-server-svc.openshift-kepler-operator.svc.cluster.local' ([Errno -2] Name or service not known)")) get archived model get init url https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip try getting archieved model from URL: https://raw.githubusercontent.com/sunya-ch/kepler-model-db/css-117/models/v0.6/css-117/core96_v0.6/rapl/AbsPower/BPFOnly/KNeighborsRegressorTrainer_1.zip for AbsPower <Response [200]> load model from config: /mnt/download/rapl/AbsPower
@vprashar2929 Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.
For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.
Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.
Should we use the namespace from keplerinternal
spec ?
For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.
So when I created keplerinternals I used a different namespace from kepler-operator
AFAIK controller only watches kepler-operator
ns. If you create an instance with the different namespace to kepler-operator
then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312
@sunya-ch
I see the following difference with the install of kepler using kepler
CRD from v1alpha
branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.
However, I am okay with having this if you could confirm that there is no impact or fixes an issue
@sunya-ch
I see the following difference with the install of kepler using
kepler
CRD fromv1alpha
branch vs this PR. Ideally we require that optional kepler-internal features do not introduce any difference.However, I am okay with having this if you could confirm that there is no impact or fixes an issue
configmap
![]()
Yes, it has no affect because the default is false. Only enable when it is set to true. (reference: https://github.com/sustainable-computing-io/kepler/blob/442bcfe5d3bf26a1285ff0f13d1f5017d28b9e37/pkg/model/model.go#L189)
Thank you. I did see the manual namespace set for model server default URL. Now, I updated it to use the namespace from the kepler spec.
Should we use the namespace from
keplerinternal
spec ?For status not updated, could you share describe result? I cannot find the cause of issue since I can see the updated status on my OpenShift cluster (4.12). From the above result, it seems the issue not only from model server but also the status of the deamonset.
So when I created keplerinternals I used a different namespace from
kepler-operator
AFAIK controller only watcheskepler-operator
ns. If you create an instance with the different namespace tokepler-operator
then you won't be able to see the status. This is a known issue and we are planning to fix it by adding config-map #312
Finally, I guess so, we may allow Kepler to deploy on any namespace for keplerinternal. I put TODO in this PR: https://github.com/sustainable-computing-io/kepler-operator/pull/324/files#diff-435eecb1d2af40ee96747f88ab71eb2758da989c92f76dcb1f714fc1b300c633
We need to have Kepler-operator add the new namespace to the cache. Now, we have to rely on the additional namespace list in the command line. Kepler deployer has to know in advance the namespace list where they are going to allow the keplerinternal to be installed.
This PR replaces https://github.com/sustainable-computing-io/kepler-operator/pull/235 by moving the integration to Kepler-internal API.
Change summary:
KeplerInternalSpec
andKeplerInternalStatus
AddIfNotEmpty
andVolumeFromEmptyDir
utility function (used for Estimator and ModelServer creation)Bug fixes (not related to model server):
UP-TO-DATE
statusKeplerInternal
Here is the CR that I used for running in my local cluster.
KeplerInternal Status
With neither estimator nor modelserver
With only estimator
With estimator+modelserver
Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com