sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.11k stars 176 forks source link

The Deployment "kepler-model-server" is invalid: spec.template.spec.containers[1].volumeMounts[1].name: Not found: "model-data" #1306

Closed Tobias-Pe closed 5 months ago

Tobias-Pe commented 6 months ago

What happened?

Kubectl create doesnt work

What did you expect to happen?

The autogenerated deployment should work

How can we reproduce it (as minimally and precisely as possible)?

make build-manifest OPTS="PROMETHEUS_DEPLOY HIGH_GRANULARITY ESTIMATOR_SIDECAR_DEPLOY MODEL_SERVER_DEPLOY TRAINER_DEPLOY BM_DEPLOY"

microk8s kubectl create -f _output/generated-manifest/deployment.yaml

-->

namespace/kepler created
serviceaccount/kepler-sa created
role.rbac.authorization.k8s.io/prometheus-k8s created
clusterrole.rbac.authorization.k8s.io/kepler-clusterrole created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
clusterrolebinding.rbac.authorization.k8s.io/kepler-clusterrole-binding created
configmap/kepler-cfm created
configmap/kepler-model-server-cfm created
secret/redfish-4kh9d7bc7m created
service/kepler-exporter created
service/kepler-model-server created
daemonset.apps/kepler-exporter created
prometheusrule.monitoring.coreos.com/kepler-common-rules created
prometheusrule.monitoring.coreos.com/kepler-high-granularity-rules created
servicemonitor.monitoring.coreos.com/kepler-exporter created
The Deployment "kepler-model-server" is invalid: spec.template.spec.containers[1].volumeMounts[1].name: Not found: "model-data"

Anything else we need to know?

No response

Kepler image tag

current master

Kubernetes version

```console $ kubectl version # paste output here ``` Client Version: v1.28.7 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.7

Cloud provider or bare metal

bare metal

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

sunya-ch commented 6 months ago

@Tobias-Pe We have a significant refactoring on the kepler-model-server since v0.6 and online trainer (TRAINER_DEPLOY) is now not on the track. The instruction is needed to be updated. Please check the current support in https://github.com/sustainable-computing-io/kepler-model-server.

Tobias-Pe commented 6 months ago

@sunya-ch so currently I should use this repository to deploy the Exporteur and then the one that u sent me to deploy the Model Server with the sidecar Service.

Did I get that correctly?

I tried to use the Operator but that didn't Workout either :/

Tobias-Pe commented 6 months ago

Ok i tried not without the TRAINER_DEPLOY and it launches.

@sunya-ch

The exporter estimator sidecar constantly reports:

│ <Response [404]>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[642.3333333333334,0,0,83.66666666666667,0,314660758.6666667,314660758.6666667,830713763.6666666,12722920.333333334,9924.666666666666]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""}                            │
│ get archived model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip                                                                                                                                                                                                                                                                                                                                                          │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip   for AbsPower                                                                                                                                                                                                                                                                                                                  │
│ <Response [404]>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[624,0,0,95.33333333333333,0.3333333333333333,167663746.66666666,167663746.66666666,101022270.33333333,4303573.333333333,6950]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""}                                    │
│ get archived model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip                                                                                                                                                                                                                                                                                                                                                          │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip   for AbsPower                                                                                                                                                                                                                                                                                                                  │
│ <Response [404]>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[516,0,0,98,0,719015324.3333334,719015324.3333334,692642447.3333334,12332859.666666666,3449.3333333333335]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""}                                                        │
│ get archived model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip                                                                                                                                                                                                                                                                                                                                                          │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip   for AbsPower                                                                                                                                                                                                                                                                                                                  │
│ <Response [404]>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[624.3333333333334,0,0,86.66666666666667,0.3333333333333333,985240291.3333334,985240291.3333334,565229407.3333334,16432532.666666666,8262]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""}                        │
│ get archived model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip                                                                                                                                                                                                                                                                                                                                                          │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip   for AbsPower                                                                                                                                                                                                                                                                                                                  │
│ <Response [404]>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[655.6666666666666,0,0,79.33333333333333,0,265710233.66666666,265710233.66666666,136125651.66666666,5552210.666666667,9290.666666666666]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""}                          │
│ get archived model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip                                                                                                                                                                                                                                                                                                                                                          │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip   for AbsPower                                                                                                                                                                                                                                                                                                                  │
│ get archived model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip                                                                                                                                                                                                                                                                                                                                                          │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip   for AbsPower                                                                                                                                                                                                                                                                                                                  │
│ <Response [404]>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[517,0,0,120.33333333333333,0.3333333333333333,647402143.6666666,647402143.6666666,1791776011.3333333,27562080,6423.333333333333]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""}
sunya-ch commented 6 months ago

@Tobias-Pe Thank you for your understanding. It is the issue of space after the model URL in the configmap. You may quickly fix locally by remove the space in the model URL NODE_TOTAL_INIT_URL from the below command. (Please recheck your deployed namespace)

kubectl edit configmap -n kepler kepler-cfm

This should be fixed by this PR: https://github.com/sustainable-computing-io/kepler/pull/1309

change needed: https://github.com/sustainable-computing-io/kepler/compare/9c03ce063404d571072035396456e82a591a8192..ee5e585f0ba47f59cc53f20599593c7adecafcbe

sunya-ch commented 6 months ago

@sunya-ch so currently I should use this repository to deploy the Exporteur and then the one that u sent me to deploy the Model Server with the sidecar Service.

Did I get that correctly?

I tried to use the Operator but that didn't Workout either :/

Could you share what is an error you found when installing with the Kepler-operator? That should be the simplest and up-to-date way to install.

It must be installed with KeplerInternal CR. You can use below example as a template. Also, there are significant changes on metric name between Kepler v0.6 and v0.7. For v0.7, you may find more model choices here.

apiVersion: kepler.system.sustainable.computing.io/v1alpha1
kind: KeplerInternal
metadata:
  annotations:
    kepler.sustainable.computing.io/bpf-attach-method: libbpf
  labels:
    app.kubernetes.io/name: kepler
    app.kubernetes.io/instance: kepler
    app.kubernetes.io/part-of: kepler-operator
  name: kepler
spec:
  exporter:
    deployment:
      image: quay.io/sustainable_computing_io/kepler:release-0.6.1-libbpf
      namespace: kepler-operator
  openshift:
    enabled: true
    dashboard:
      enabled: true
  modelServer:
    enabled: true
  estimator:
    node:
      components:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/rapl/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
      total:
        sidecar: true
        initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
Tobias-Pe commented 6 months ago

When i do:

make tools make run

an error appears ... :

./hack/tools.sh kustomize
   ✅ kustomize matching v3.8.7 already installed
{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:14:14Z GoOs:linux GoArch:amd64}
./hack/tools.sh controller-gen
   ✅ controller-gen matching Version: v0.12.1 already installed
Version: v0.12.1
/home/pesla/kepler-operator/tmp/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/..."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa0b38f]

goroutine 210 [running]:
go/types.(*Checker).handleBailout(0xc0020ae600, 0xc000787d40)
        /snap/go/10553/src/go/types/check.go:367 +0x88
panic({0xbc51a0?, 0x12afca0?})
        /snap/go/10553/src/runtime/panic.go:770 +0x132
go/types.(*StdSizes).Sizeof(0x0, {0xdc0038, 0x12b8420})
        /snap/go/10553/src/go/types/sizes.go:228 +0x30f
go/types.(*Config).sizeof(...)
        /snap/go/10553/src/go/types/sizes.go:333
go/types.representableConst.func1({0xdc0038?, 0x12b8420?})
        /snap/go/10553/src/go/types/const.go:76 +0x9e
go/types.representableConst({0xdc63d0, 0x1284520}, 0xc0020ae600, 0x12b8420, 0x0)
        /snap/go/10553/src/go/types/const.go:92 +0x192
go/types.(*Checker).arrayLength(0xc0020ae600, {0xdc46e8, 0xc001b5abc0?})
        /snap/go/10553/src/go/types/typexpr.go:510 +0x2d3
go/types.(*Checker).typInternal(0xc0020ae600, {0xdc2d08, 0xc001b55ef0}, 0x0)
        /snap/go/10553/src/go/types/typexpr.go:299 +0x49d
go/types.(*Checker).definedType(0xc0020ae600, {0xdc2d08, 0xc001b55ef0}, 0xc000787328?)
        /snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).varType(0xc0020ae600, {0xdc2d08, 0xc001b55ef0})
        /snap/go/10553/src/go/types/typexpr.go:145 +0x25
go/types.(*Checker).structType(0xc0020ae600, 0xc0020addd0, 0xc0020addd0?)
        /snap/go/10553/src/go/types/struct.go:113 +0x19f
go/types.(*Checker).typInternal(0xc0020ae600, {0xdc2c78, 0xc001b3f2d8}, 0xc0020b2d20)
        /snap/go/10553/src/go/types/typexpr.go:316 +0x1345
go/types.(*Checker).definedType(0xc0020ae600, {0xdc2c78, 0xc001b3f2d8}, 0xc8ed8b?)
        /snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).typeDecl(0xc0020ae600, 0xc0020b2d20, 0xc001b58ac0, 0x0)
        /snap/go/10553/src/go/types/decl.go:615 +0x44d
go/types.(*Checker).objDecl(0xc0020ae600, {0xdcb9e0, 0xc0020b2d20}, 0x0)
        /snap/go/10553/src/go/types/decl.go:197 +0xa7f
go/types.(*Checker).packageObjects(0xc0020ae600)
        /snap/go/10553/src/go/types/resolver.go:681 +0x425
go/types.(*Checker).checkFiles(0xc0020ae600, {0xc001a99ec0, 0x3, 0x3})
        /snap/go/10553/src/go/types/check.go:408 +0x1a5
go/types.(*Checker).Files(...)
        /snap/go/10553/src/go/types/check.go:372
sigs.k8s.io/controller-tools/pkg/loader.(*loader).typeCheck(0xc0003a7380, 0xc00040aaa0)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:286 +0x36a
sigs.k8s.io/controller-tools/pkg/loader.(*Package).NeedTypesInfo(0xc00040aaa0)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:99 +0x39
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check(0xc000a58a50, 0xc00040aaa0)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:268 +0x2b7
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check.func1(0x4b?)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:262 +0x53
created by sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check in goroutine 109
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:260 +0x1c5
make: *** [Makefile:96: generate] Error 2
Tobias-Pe commented 6 months ago

@sunya-ch

make deploy OPERATOR_IMG=quay.io/sustainable_computing_io/kepler-operator:latest

also results in a runtime error

./hack/tools.sh kustomize
   ✅ kustomize matching v3.8.7 already installed
{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:14:14Z GoOs:linux GoArch:amd64}
./hack/tools.sh controller-gen
   ✅ controller-gen matching Version: v0.12.1 already installed
Version: v0.12.1
/home/pesla/kepler-operator/tmp/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/..."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa0b38f]

goroutine 138 [running]:
go/types.(*Checker).handleBailout(0xc001b9f000, 0xc002129d40)
        /snap/go/10553/src/go/types/check.go:367 +0x88
panic({0xbc51a0?, 0x12afca0?})
        /snap/go/10553/src/runtime/panic.go:770 +0x132
go/types.(*StdSizes).Sizeof(0x0, {0xdc0038, 0x12b8420})
        /snap/go/10553/src/go/types/sizes.go:228 +0x30f
go/types.(*Config).sizeof(...)
        /snap/go/10553/src/go/types/sizes.go:333
go/types.representableConst.func1({0xdc0038?, 0x12b8420?})
        /snap/go/10553/src/go/types/const.go:76 +0x9e
go/types.representableConst({0xdc63d0, 0x1284520}, 0xc001b9f000, 0x12b8420, 0x0)
        /snap/go/10553/src/go/types/const.go:92 +0x192
go/types.(*Checker).arrayLength(0xc001b9f000, {0xdc46e8, 0xc001bc1600?})
        /snap/go/10553/src/go/types/typexpr.go:510 +0x2d3
go/types.(*Checker).typInternal(0xc001b9f000, {0xdc2d08, 0xc001bd1890}, 0x0)
        /snap/go/10553/src/go/types/typexpr.go:299 +0x49d
go/types.(*Checker).definedType(0xc001b9f000, {0xdc2d08, 0xc001bd1890}, 0xc002129328?)
        /snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).varType(0xc001b9f000, {0xdc2d08, 0xc001bd1890})
        /snap/go/10553/src/go/types/typexpr.go:145 +0x25
go/types.(*Checker).structType(0xc001b9f000, 0xc0022a6e10, 0xc0022a6e10?)
        /snap/go/10553/src/go/types/struct.go:113 +0x19f
go/types.(*Checker).typInternal(0xc001b9f000, {0xdc2c78, 0xc001bd4498}, 0xc001bea9b0)
        /snap/go/10553/src/go/types/typexpr.go:316 +0x1345
go/types.(*Checker).definedType(0xc001b9f000, {0xdc2c78, 0xc001bd4498}, 0xc8ed8b?)
        /snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).typeDecl(0xc001b9f000, 0xc001bea9b0, 0xc001bd6440, 0x0)
        /snap/go/10553/src/go/types/decl.go:615 +0x44d
go/types.(*Checker).objDecl(0xc001b9f000, {0xdcb9e0, 0xc001bea9b0}, 0x0)
        /snap/go/10553/src/go/types/decl.go:197 +0xa7f
go/types.(*Checker).packageObjects(0xc001b9f000)
        /snap/go/10553/src/go/types/resolver.go:681 +0x425
go/types.(*Checker).checkFiles(0xc001b9f000, {0xc0017e73e0, 0x3, 0x3})
        /snap/go/10553/src/go/types/check.go:408 +0x1a5
go/types.(*Checker).Files(...)
        /snap/go/10553/src/go/types/check.go:372
sigs.k8s.io/controller-tools/pkg/loader.(*loader).typeCheck(0xc00025d380, 0xc00045e020)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:286 +0x36a
sigs.k8s.io/controller-tools/pkg/loader.(*Package).NeedTypesInfo(0xc00045e020)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:99 +0x39
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check(0xc000b44660, 0xc00045e020)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:268 +0x2b7
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check.func1(0x45?)
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:262 +0x53
created by sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check in goroutine 52
        /home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:260 +0x1c5
make: *** [Makefile:96: generate] Error 2
Tobias-Pe commented 6 months ago

The default branch as well as the Release 0.10.0 branch raises this error

Tobias-Pe commented 6 months ago

@Tobias-Pe Thank you for your understanding. It is the issue of space after the model URL in the configmap. You may quickly fix locally by remove the space in the model URL NODE_TOTAL_INIT_URL from the below command. (Please recheck your deployed namespace)

kubectl edit configmap -n kepler kepler-cfm

This should be fixed by this PR: #1309

change needed: https://github.com/sustainable-computing-io/kepler/compare/9c03ce063404d571072035396456e82a591a8192..ee5e585f0ba47f59cc53f20599593c7adecafcbe

@sunya-ch Great this one did the trick and i get a 200 Response using the kepler repo (not the operator) with your proposed tags and the config tweak!

The output looks like this:

│ set NODE_COMPONENTS_ESTIMATOR to true.                                                                                                                                                                                                                                               │
│ set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/rapl/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.                                                                              │
│ set NODE_TOTAL_ESTIMATOR to true.                                                                                                                                                                                                                                                    │
│ set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip.                                                                                   │
│ clean socket                                                                                                                                                                                                                                                                         │
│ get archived model                                                                                                                                                                                                                                                                   │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip                                                                                                  │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower                                                            │
│ <Response [200]>                                                                                                                                                                                                                                                                     │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[0,0,0,0,0,0,0,0,0,0]],"output_type":"AbsPower","sour │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"                                                                                                                                             │
│                                                                                                                                                                                                                                                                                      │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"

Is it normal that it failes to predict constantly ?

The model server output looks like this currently:


│ 2024-03-20T18:56:55.464098822Z 10.1.194.163 - - [20/Mar/2024 18:56:55] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:56:57.530504039Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:56:57.532526857Z 10.1.28.241 - - [20/Mar/2024 18:56:57] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                   │
│ 2024-03-20T18:56:58.076840641Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:56:58.078481004Z 10.1.51.77 - - [20/Mar/2024 18:56:58] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:56:58.474284952Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:56:58.475664337Z 10.1.194.163 - - [20/Mar/2024 18:56:58] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:01.082462471Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:01.083728960Z 10.1.51.77 - - [20/Mar/2024 18:57:01] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:01.472447436Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:01.472915576Z 10.1.194.163 - - [20/Mar/2024 18:57:01] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:04.073526318Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:04.074460195Z 10.1.51.77 - - [20/Mar/2024 18:57:04] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:04.475021501Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:04.476509315Z 10.1.194.163 - - [20/Mar/2024 18:57:04] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:06.381812892Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:06.381849979Z 10.1.223.112 - - [20/Mar/2024 18:57:06] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:07.087097512Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:07.088653211Z 10.1.51.77 - - [20/Mar/2024 18:57:07] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:07.473618960Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:07.474868034Z 10.1.194.163 - - [20/Mar/2024 18:57:07] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:10.121285778Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:10.121316798Z 10.1.51.77 - - [20/Mar/2024 18:57:10] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:10.475058495Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:10.476509505Z 10.1.194.163 - - [20/Mar/2024 18:57:10] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:13.072093731Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:13.072920807Z 10.1.51.77 - - [20/Mar/2024 18:57:13] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:13.468112201Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:13.469282490Z 10.1.194.163 - - [20/Mar/2024 18:57:13] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:15.777616891Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:15.777991780Z 10.1.28.242 - - [20/Mar/2024 18:57:15] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                   │
│ 2024-03-20T18:57:16.071828055Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:16.073108087Z 10.1.51.77 - - [20/Mar/2024 18:57:16] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:16.468917521Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:16.469290678Z 10.1.194.163 - - [20/Mar/2024 18:57:16] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:19.073830740Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:19.075077697Z 10.1.51.77 - - [20/Mar/2024 18:57:19] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:19.465202820Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:19.465789186Z 10.1.194.163 - - [20/Mar/2024 18:57:19] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:22.471949272Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:22.472526374Z 10.1.194.163 - - [20/Mar/2024 18:57:22] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:25.465108094Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:25.465482514Z 10.1.194.163 - - [20/Mar/2024 18:57:25] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:28.066184834Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:28.066621562Z 10.1.221.178 - - [20/Mar/2024 18:57:28] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:28.476831709Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:28.478344292Z 10.1.194.163 - - [20/Mar/2024 18:57:28] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                  │
│ 2024-03-20T18:57:38.246367658Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:38.247878462Z 10.1.51.79 - - [20/Mar/2024 18:57:38] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                    │
│ 2024-03-20T18:57:45.291626589Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:45.292329241Z 10.1.194.162 - - [20/Mar/2024 18:57:45] "POST /model HTTP/1.1" 400 -
sunya-ch commented 5 months ago

bpf_cpu_time_us

@Tobias-Pe There is a version conflict between the power model and the kepler metric exporter. The bpf_cpu_time_us (v0.6) is changed to bpf_cpu_time_ms (v0.7) (You may refer to the PR https://github.com/sustainable-computing-io/kepler/pull/1214 and https://github.com/sustainable-computing-io/kepler-model-server/pull/227). It seems you are installing the kepler with version 0.7 (the make deploy in kepler-model-server deployment is 0.7). You must use the power model from kepler-model-db v0.7 (check your node type: https://github.com/sustainable-computing-io/kepler-model-db/blob/main/models/v0.7/specpower/README.md).

@sthaha Do you have any idea about the operator deployment error above?

sthaha commented 5 months ago

@Tobias-Pe . From the logs

@sunya-ch

make deploy OPERATOR_IMG=quay.io/sustainable_computing_io/kepler-operator:latest

also results in a runtime error

./hack/tools.sh kustomize
   ✅ kustomize matching v3.8.7 already installed
{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:14:14Z GoOs:linux GoArch:amd64}
./hack/tools.sh controller-gen
   ✅ controller-gen matching Version: v0.12.1 already installed
Version: v0.12.1
/home/pesla/kepler-operator/tmp/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/..."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa0b38f]

I see that controller-gen (tool used to generated code) seems to have crashed, and this has happened during the build phase of the operator as opposed to when operator is running. So I suspect something wrong with the setup for the tools / golang

In relation to make deploy, unlike in the past, it no longer works after webhook was added to the operator since it requires certs to the mounted. On OpenShift, (or with OLM installed), these certs are automatically created by OLM and does not require cert-manager.

It would be a great contribution to the operator project if someone has the band width to fix this. Most of the cert-manager config is already checked into the project.

Tobias-Pe commented 5 months ago

@sthaha

I am using a multi node microk8s Cluster which is pretty close to the vanilla setup without any certificate Management.

Is it some addon ?

sthaha commented 5 months ago

Information about cert-manager can be found here - https://cert-manager.io/docs/installation/kubectl/

You can find information about usage of cert-manager in kubernetes webhooks here - https://book.kubebuilder.io/cronjob-tutorial/cert-manager And enabling cert-manager for kepler-operator can be found here -https://github.com/sustainable-computing-io/kepler-operator/blob/v1alpha1/config/default/kustomization.yaml#L22

I haven't tried this out myself but I think it may be worth copying config/default to config/k8s and then configure it to use cert-manager. The output of running kustomize on config/k8s should produce all resources required to isntall operator on vanilla k8s with cert-manager installed.

sunya-ch commented 5 months ago

Regarding the model, now I'm working on the inconsistent model version. You may track the progress from issue: https://github.com/sustainable-computing-io/kepler-model-server/issues/242

This following PR should fix the issue of compatibility:

rootfs commented 5 months ago

closing this for now. If @Tobias-Pe have any followup issues, will reopen it

Tobias-Pe commented 5 months ago

@sunya-ch

Estimator:

│ set NODE_COMPONENTS_ESTIMATOR to true.                                                                                                                                                                                                                                               │
│ set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/ec2/intel_rapl/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip.                                                                                  │
│ set NODE_TOTAL_ESTIMATOR to true.                                                                                                                                                                                                                                                    │
│ set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/specpower/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip.                                                                                       │
│ clean socket                                                                                                                                                                                                                                                                         │
│ get archived model                                                                                                                                                                                                                                                                   │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/specpower/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip                                                                                                      │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/specpower/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip for AbsPower                                                                │
│ <Response [200]>                                                                                                                                                                                                                                                                     │
│ load model from config:  /mnt/download/acpi/AbsPower                                                                                                                                                                                                                                 │
│                                                                                                                                                                                                                                                                                      │
│

Model Server:

│ try downloading archieved pipeline from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6.zip                                                                                                                           │
│ <Response [200]>                                                                                                                                                                                                                                                                     │
│ initial pipeline is loaded to /mnt/models/default                                                                                                                                                                                                                                    │
│  * Serving Flask app 'model_server' (lazy loading)                                                                                                                                                                                                                                   │
│  * Environment: production                                                                                                                                                                                                                                                           │
│    WARNING: This is a development server. Do not use it in a production deployment.                                                                                                                                                                                                  │
│    Use a production WSGI server instead.                                                                                                                                                                                                                                             │
│  * Debug mode: off                                                                                                                                                                                                                                                                   │
│ WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.                                                                                                                                                               │
│  * Running on all addresses (0.0.0.0)                                                                                                                                                                                                                                                │
│  * Running on http://127.0.0.1:8100                                                                                                                                                                                                                                                  │
│  * Running on http://10.1.221.171:8100                                                                                                                                                                                                                                               │
│ Press CTRL+C to quit                                                                                                                                                                                                                                                                 │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi',  │
│ 10.1.194.132 - - [08/Apr/2024 06:50:35] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                                                 │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_cycles', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi',  │
│ 10.1.223.75 - - [08/Apr/2024 06:50:36] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                                                  │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi',  │
│ 10.1.51.91 - - [08/Apr/2024 06:50:37] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                                                   │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi',  │
│ 10.1.28.211 - - [08/Apr/2024 06:50:40] "POST /model HTTP/1.1" 400 -                                                                                                                                                                                                                  │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi',  │
│ 10.1.221.170 - - [08/Apr/2024 06:50:51] "POST /model HTTP/1.1" 400 -

The model server is still trying to get version 0.6 if i recall correctly from the logs.