Closed Tobias-Pe closed 5 months ago
@Tobias-Pe We have a significant refactoring on the kepler-model-server since v0.6 and online trainer (TRAINER_DEPLOY) is now not on the track. The instruction is needed to be updated. Please check the current support in https://github.com/sustainable-computing-io/kepler-model-server.
@sunya-ch so currently I should use this repository to deploy the Exporteur and then the one that u sent me to deploy the Model Server with the sidecar Service.
Did I get that correctly?
I tried to use the Operator but that didn't Workout either :/
Ok i tried not without the TRAINER_DEPLOY and it launches.
@sunya-ch
The exporter estimator sidecar constantly reports:
│ <Response [404]> │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[642.3333333333334,0,0,83.66666666666667,0,314660758.6666667,314660758.6666667,830713763.6666666,12722920.333333334,9924.666666666666]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""} │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower │
│ <Response [404]> │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[624,0,0,95.33333333333333,0.3333333333333333,167663746.66666666,167663746.66666666,101022270.33333333,4303573.333333333,6950]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""} │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower │
│ <Response [404]> │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[516,0,0,98,0,719015324.3333334,719015324.3333334,692642447.3333334,12332859.666666666,3449.3333333333335]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""} │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower │
│ <Response [404]> │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[624.3333333333334,0,0,86.66666666666667,0.3333333333333333,985240291.3333334,985240291.3333334,565229407.3333334,16432532.666666666,8262]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""} │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower │
│ <Response [404]> │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[655.6666666666666,0,0,79.33333333333333,0,265710233.66666666,265710233.66666666,136125651.66666666,5552210.666666667,9290.666666666666]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""} │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower │
│ <Response [404]> │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[517,0,0,120.33333333333333,0.3333333333333333,647402143.6666666,647402143.6666666,1791776011.3333333,27562080,6423.333333333333]],"output_type":"AbsPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"","filter":""}
@Tobias-Pe Thank you for your understanding. It is the issue of space after the model URL in the configmap. You may quickly fix locally by remove the space in the model URL NODE_TOTAL_INIT_URL
from the below command. (Please recheck your deployed namespace)
kubectl edit configmap -n kepler kepler-cfm
This should be fixed by this PR: https://github.com/sustainable-computing-io/kepler/pull/1309
@sunya-ch so currently I should use this repository to deploy the Exporteur and then the one that u sent me to deploy the Model Server with the sidecar Service.
Did I get that correctly?
I tried to use the Operator but that didn't Workout either :/
Could you share what is an error you found when installing with the Kepler-operator? That should be the simplest and up-to-date way to install.
It must be installed with KeplerInternal CR. You can use below example as a template. Also, there are significant changes on metric name between Kepler v0.6 and v0.7. For v0.7, you may find more model choices here.
apiVersion: kepler.system.sustainable.computing.io/v1alpha1
kind: KeplerInternal
metadata:
annotations:
kepler.sustainable.computing.io/bpf-attach-method: libbpf
labels:
app.kubernetes.io/name: kepler
app.kubernetes.io/instance: kepler
app.kubernetes.io/part-of: kepler-operator
name: kepler
spec:
exporter:
deployment:
image: quay.io/sustainable_computing_io/kepler:release-0.6.1-libbpf
namespace: kepler-operator
openshift:
enabled: true
dashboard:
enabled: true
modelServer:
enabled: true
estimator:
node:
components:
sidecar: true
initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/rapl/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
total:
sidecar: true
initUrl: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip
When i do:
make tools
make run
an error appears ... :
./hack/tools.sh kustomize
✅ kustomize matching v3.8.7 already installed
{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:14:14Z GoOs:linux GoArch:amd64}
./hack/tools.sh controller-gen
✅ controller-gen matching Version: v0.12.1 already installed
Version: v0.12.1
/home/pesla/kepler-operator/tmp/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/..."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa0b38f]
goroutine 210 [running]:
go/types.(*Checker).handleBailout(0xc0020ae600, 0xc000787d40)
/snap/go/10553/src/go/types/check.go:367 +0x88
panic({0xbc51a0?, 0x12afca0?})
/snap/go/10553/src/runtime/panic.go:770 +0x132
go/types.(*StdSizes).Sizeof(0x0, {0xdc0038, 0x12b8420})
/snap/go/10553/src/go/types/sizes.go:228 +0x30f
go/types.(*Config).sizeof(...)
/snap/go/10553/src/go/types/sizes.go:333
go/types.representableConst.func1({0xdc0038?, 0x12b8420?})
/snap/go/10553/src/go/types/const.go:76 +0x9e
go/types.representableConst({0xdc63d0, 0x1284520}, 0xc0020ae600, 0x12b8420, 0x0)
/snap/go/10553/src/go/types/const.go:92 +0x192
go/types.(*Checker).arrayLength(0xc0020ae600, {0xdc46e8, 0xc001b5abc0?})
/snap/go/10553/src/go/types/typexpr.go:510 +0x2d3
go/types.(*Checker).typInternal(0xc0020ae600, {0xdc2d08, 0xc001b55ef0}, 0x0)
/snap/go/10553/src/go/types/typexpr.go:299 +0x49d
go/types.(*Checker).definedType(0xc0020ae600, {0xdc2d08, 0xc001b55ef0}, 0xc000787328?)
/snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).varType(0xc0020ae600, {0xdc2d08, 0xc001b55ef0})
/snap/go/10553/src/go/types/typexpr.go:145 +0x25
go/types.(*Checker).structType(0xc0020ae600, 0xc0020addd0, 0xc0020addd0?)
/snap/go/10553/src/go/types/struct.go:113 +0x19f
go/types.(*Checker).typInternal(0xc0020ae600, {0xdc2c78, 0xc001b3f2d8}, 0xc0020b2d20)
/snap/go/10553/src/go/types/typexpr.go:316 +0x1345
go/types.(*Checker).definedType(0xc0020ae600, {0xdc2c78, 0xc001b3f2d8}, 0xc8ed8b?)
/snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).typeDecl(0xc0020ae600, 0xc0020b2d20, 0xc001b58ac0, 0x0)
/snap/go/10553/src/go/types/decl.go:615 +0x44d
go/types.(*Checker).objDecl(0xc0020ae600, {0xdcb9e0, 0xc0020b2d20}, 0x0)
/snap/go/10553/src/go/types/decl.go:197 +0xa7f
go/types.(*Checker).packageObjects(0xc0020ae600)
/snap/go/10553/src/go/types/resolver.go:681 +0x425
go/types.(*Checker).checkFiles(0xc0020ae600, {0xc001a99ec0, 0x3, 0x3})
/snap/go/10553/src/go/types/check.go:408 +0x1a5
go/types.(*Checker).Files(...)
/snap/go/10553/src/go/types/check.go:372
sigs.k8s.io/controller-tools/pkg/loader.(*loader).typeCheck(0xc0003a7380, 0xc00040aaa0)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:286 +0x36a
sigs.k8s.io/controller-tools/pkg/loader.(*Package).NeedTypesInfo(0xc00040aaa0)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:99 +0x39
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check(0xc000a58a50, 0xc00040aaa0)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:268 +0x2b7
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check.func1(0x4b?)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:262 +0x53
created by sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check in goroutine 109
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:260 +0x1c5
make: *** [Makefile:96: generate] Error 2
@sunya-ch
make deploy OPERATOR_IMG=quay.io/sustainable_computing_io/kepler-operator:latest
also results in a runtime error
./hack/tools.sh kustomize
✅ kustomize matching v3.8.7 already installed
{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:14:14Z GoOs:linux GoArch:amd64}
./hack/tools.sh controller-gen
✅ controller-gen matching Version: v0.12.1 already installed
Version: v0.12.1
/home/pesla/kepler-operator/tmp/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/..."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa0b38f]
goroutine 138 [running]:
go/types.(*Checker).handleBailout(0xc001b9f000, 0xc002129d40)
/snap/go/10553/src/go/types/check.go:367 +0x88
panic({0xbc51a0?, 0x12afca0?})
/snap/go/10553/src/runtime/panic.go:770 +0x132
go/types.(*StdSizes).Sizeof(0x0, {0xdc0038, 0x12b8420})
/snap/go/10553/src/go/types/sizes.go:228 +0x30f
go/types.(*Config).sizeof(...)
/snap/go/10553/src/go/types/sizes.go:333
go/types.representableConst.func1({0xdc0038?, 0x12b8420?})
/snap/go/10553/src/go/types/const.go:76 +0x9e
go/types.representableConst({0xdc63d0, 0x1284520}, 0xc001b9f000, 0x12b8420, 0x0)
/snap/go/10553/src/go/types/const.go:92 +0x192
go/types.(*Checker).arrayLength(0xc001b9f000, {0xdc46e8, 0xc001bc1600?})
/snap/go/10553/src/go/types/typexpr.go:510 +0x2d3
go/types.(*Checker).typInternal(0xc001b9f000, {0xdc2d08, 0xc001bd1890}, 0x0)
/snap/go/10553/src/go/types/typexpr.go:299 +0x49d
go/types.(*Checker).definedType(0xc001b9f000, {0xdc2d08, 0xc001bd1890}, 0xc002129328?)
/snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).varType(0xc001b9f000, {0xdc2d08, 0xc001bd1890})
/snap/go/10553/src/go/types/typexpr.go:145 +0x25
go/types.(*Checker).structType(0xc001b9f000, 0xc0022a6e10, 0xc0022a6e10?)
/snap/go/10553/src/go/types/struct.go:113 +0x19f
go/types.(*Checker).typInternal(0xc001b9f000, {0xdc2c78, 0xc001bd4498}, 0xc001bea9b0)
/snap/go/10553/src/go/types/typexpr.go:316 +0x1345
go/types.(*Checker).definedType(0xc001b9f000, {0xdc2c78, 0xc001bd4498}, 0xc8ed8b?)
/snap/go/10553/src/go/types/typexpr.go:180 +0x37
go/types.(*Checker).typeDecl(0xc001b9f000, 0xc001bea9b0, 0xc001bd6440, 0x0)
/snap/go/10553/src/go/types/decl.go:615 +0x44d
go/types.(*Checker).objDecl(0xc001b9f000, {0xdcb9e0, 0xc001bea9b0}, 0x0)
/snap/go/10553/src/go/types/decl.go:197 +0xa7f
go/types.(*Checker).packageObjects(0xc001b9f000)
/snap/go/10553/src/go/types/resolver.go:681 +0x425
go/types.(*Checker).checkFiles(0xc001b9f000, {0xc0017e73e0, 0x3, 0x3})
/snap/go/10553/src/go/types/check.go:408 +0x1a5
go/types.(*Checker).Files(...)
/snap/go/10553/src/go/types/check.go:372
sigs.k8s.io/controller-tools/pkg/loader.(*loader).typeCheck(0xc00025d380, 0xc00045e020)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:286 +0x36a
sigs.k8s.io/controller-tools/pkg/loader.(*Package).NeedTypesInfo(0xc00045e020)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/loader.go:99 +0x39
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check(0xc000b44660, 0xc00045e020)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:268 +0x2b7
sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check.func1(0x45?)
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:262 +0x53
created by sigs.k8s.io/controller-tools/pkg/loader.(*TypeChecker).check in goroutine 52
/home/pesla/go/pkg/mod/sigs.k8s.io/controller-tools@v0.12.1/pkg/loader/refs.go:260 +0x1c5
make: *** [Makefile:96: generate] Error 2
The default branch as well as the Release 0.10.0 branch raises this error
@Tobias-Pe Thank you for your understanding. It is the issue of space after the model URL in the configmap. You may quickly fix locally by remove the space in the model URL
NODE_TOTAL_INIT_URL
from the below command. (Please recheck your deployed namespace)kubectl edit configmap -n kepler kepler-cfm
This should be fixed by this PR: #1309
@sunya-ch Great this one did the trick and i get a 200 Response using the kepler repo (not the operator) with your proposed tags and the config tweak!
The output looks like this:
│ set NODE_COMPONENTS_ESTIMATOR to true. │
│ set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/rapl/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip. │
│ set NODE_TOTAL_ESTIMATOR to true. │
│ set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip. │
│ clean socket │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_1.zip for AbsPower │
│ <Response [200]> │
│ failed to get model from request {"metrics":["bpf_cpu_time_ms","bpf_page_cache_hit","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_ref_cycles","cpu_instructions","cache_miss","task_clock_ms"],"values":[[0,0,0,0,0,0,0,0,0,0]],"output_type":"AbsPower","sour │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]" │
│ │
│ GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"
Is it normal that it failes to predict constantly ?
The model server output looks like this currently:
│ 2024-03-20T18:56:55.464098822Z 10.1.194.163 - - [20/Mar/2024 18:56:55] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:56:57.530504039Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:56:57.532526857Z 10.1.28.241 - - [20/Mar/2024 18:56:57] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:56:58.076840641Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:56:58.078481004Z 10.1.51.77 - - [20/Mar/2024 18:56:58] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:56:58.474284952Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:56:58.475664337Z 10.1.194.163 - - [20/Mar/2024 18:56:58] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:01.082462471Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:01.083728960Z 10.1.51.77 - - [20/Mar/2024 18:57:01] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:01.472447436Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:01.472915576Z 10.1.194.163 - - [20/Mar/2024 18:57:01] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:04.073526318Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:04.074460195Z 10.1.51.77 - - [20/Mar/2024 18:57:04] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:04.475021501Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:04.476509315Z 10.1.194.163 - - [20/Mar/2024 18:57:04] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:06.381812892Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:06.381849979Z 10.1.223.112 - - [20/Mar/2024 18:57:06] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:07.087097512Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:07.088653211Z 10.1.51.77 - - [20/Mar/2024 18:57:07] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:07.473618960Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:07.474868034Z 10.1.194.163 - - [20/Mar/2024 18:57:07] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:10.121285778Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:10.121316798Z 10.1.51.77 - - [20/Mar/2024 18:57:10] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:10.475058495Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:10.476509505Z 10.1.194.163 - - [20/Mar/2024 18:57:10] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:13.072093731Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:13.072920807Z 10.1.51.77 - - [20/Mar/2024 18:57:13] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:13.468112201Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:13.469282490Z 10.1.194.163 - - [20/Mar/2024 18:57:13] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:15.777616891Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:15.777991780Z 10.1.28.242 - - [20/Mar/2024 18:57:15] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:16.071828055Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:16.073108087Z 10.1.51.77 - - [20/Mar/2024 18:57:16] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:16.468917521Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:16.469290678Z 10.1.194.163 - - [20/Mar/2024 18:57:16] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:19.073830740Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:19.075077697Z 10.1.51.77 - - [20/Mar/2024 18:57:19] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:19.465202820Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:19.465789186Z 10.1.194.163 - - [20/Mar/2024 18:57:19] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:22.471949272Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:22.472526374Z 10.1.194.163 - - [20/Mar/2024 18:57:22] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:25.465108094Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:25.465482514Z 10.1.194.163 - - [20/Mar/2024 18:57:25] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:28.066184834Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:28.066621562Z 10.1.221.178 - - [20/Mar/2024 18:57:28] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:28.476831709Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:28.478344292Z 10.1.194.163 - - [20/Mar/2024 18:57:28] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:38.246367658Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:38.247878462Z 10.1.51.79 - - [20/Mar/2024 18:57:38] "POST /model HTTP/1.1" 400 - │
│ 2024-03-20T18:57:45.291626589Z get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': │
│ 2024-03-20T18:57:45.292329241Z 10.1.194.162 - - [20/Mar/2024 18:57:45] "POST /model HTTP/1.1" 400 -
bpf_cpu_time_us
@Tobias-Pe There is a version conflict between the power model and the kepler metric exporter.
The bpf_cpu_time_us
(v0.6) is changed to bpf_cpu_time_ms
(v0.7) (You may refer to the PR https://github.com/sustainable-computing-io/kepler/pull/1214 and https://github.com/sustainable-computing-io/kepler-model-server/pull/227).
It seems you are installing the kepler with version 0.7 (the make deploy in kepler-model-server deployment is 0.7). You must use the power model from kepler-model-db v0.7 (check your node type: https://github.com/sustainable-computing-io/kepler-model-db/blob/main/models/v0.7/specpower/README.md).
@sthaha Do you have any idea about the operator deployment error above?
@Tobias-Pe . From the logs
@sunya-ch
make deploy OPERATOR_IMG=quay.io/sustainable_computing_io/kepler-operator:latest
also results in a runtime error
./hack/tools.sh kustomize
✅ kustomize matching v3.8.7 already installed
{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:14:14Z GoOs:linux GoArch:amd64}
./hack/tools.sh controller-gen
✅ controller-gen matching Version: v0.12.1 already installed
Version: v0.12.1
/home/pesla/kepler-operator/tmp/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/..."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa0b38f]
I see that controller-gen
(tool used to generated code) seems to have crashed, and this has happened during the build phase of the operator as opposed to when operator is running. So I suspect something wrong with the setup for the tools / golang
In relation to make deploy
, unlike in the past, it no longer works after webhook was added to the operator since it requires certs to the mounted. On OpenShift, (or with OLM installed), these certs are automatically created by OLM and does not require cert-manager.
It would be a great contribution to the operator project if someone has the band width to fix this. Most of the cert-manager config is already checked into the project.
@sthaha
I am using a multi node microk8s Cluster which is pretty close to the vanilla setup without any certificate Management.
Is it some addon ?
Information about cert-manager can be found here - https://cert-manager.io/docs/installation/kubectl/
You can find information about usage of cert-manager in kubernetes webhooks here - https://book.kubebuilder.io/cronjob-tutorial/cert-manager And enabling cert-manager for kepler-operator can be found here -https://github.com/sustainable-computing-io/kepler-operator/blob/v1alpha1/config/default/kustomization.yaml#L22
I haven't tried this out myself but I think it may be worth copying config/default
to config/k8s
and then configure it to use cert-manager. The output of running kustomize on config/k8s should produce all resources required to isntall operator on vanilla k8s with cert-manager installed.
Regarding the model, now I'm working on the inconsistent model version. You may track the progress from issue: https://github.com/sustainable-computing-io/kepler-model-server/issues/242
This following PR should fix the issue of compatibility:
closing this for now. If @Tobias-Pe have any followup issues, will reopen it
@sunya-ch
Estimator:
│ set NODE_COMPONENTS_ESTIMATOR to true. │
│ set NODE_COMPONENTS_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/ec2/intel_rapl/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip. │
│ set NODE_TOTAL_ESTIMATOR to true. │
│ set NODE_TOTAL_INIT_URL to https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/specpower/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip. │
│ clean socket │
│ get archived model │
│ get init url https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/specpower/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip │
│ try getting archieved model from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.7/specpower/acpi/AbsPower/BPFOnly/GradientBoostingRegressorTrainer_0.zip for AbsPower │
│ <Response [200]> │
│ load model from config: /mnt/download/acpi/AbsPower │
│ │
│
Model Server:
│ try downloading archieved pipeline from URL: https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-db/main/models/v0.6/nx12/std_v0.6.zip │
│ <Response [200]> │
│ initial pipeline is loaded to /mnt/models/default │
│ * Serving Flask app 'model_server' (lazy loading) │
│ * Environment: production │
│ WARNING: This is a development server. Do not use it in a production deployment. │
│ Use a production WSGI server instead. │
│ * Debug mode: off │
│ WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. │
│ * Running on all addresses (0.0.0.0) │
│ * Running on http://127.0.0.1:8100 │
│ * Running on http://10.1.221.171:8100 │
│ Press CTRL+C to quit │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi', │
│ 10.1.194.132 - - [08/Apr/2024 06:50:35] "POST /model HTTP/1.1" 400 - │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_cycles', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi', │
│ 10.1.223.75 - - [08/Apr/2024 06:50:36] "POST /model HTTP/1.1" 400 - │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi', │
│ 10.1.51.91 - - [08/Apr/2024 06:50:37] "POST /model HTTP/1.1" 400 - │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'task_clock_ms', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi', │
│ 10.1.28.211 - - [08/Apr/2024 06:50:40] "POST /model HTTP/1.1" 400 - │
│ get request /model: {'metrics': ['bpf_cpu_time_ms', 'bpf_page_cache_hit', 'bpf_net_tx_irq', 'bpf_net_rx_irq', 'bpf_block_irq', 'task_clock_ms', 'cpu_cycles', 'cpu_ref_cycles', 'cpu_instructions', 'cache_miss', 'cpu_architecture'], 'output_type': 'AbsPower', 'source': 'acpi', │
│ 10.1.221.170 - - [08/Apr/2024 06:50:51] "POST /model HTTP/1.1" 400 -
The model server is still trying to get version 0.6 if i recall correctly from the logs.
What happened?
Kubectl create doesnt work
What did you expect to happen?
The autogenerated deployment should work
How can we reproduce it (as minimally and precisely as possible)?
make build-manifest OPTS="PROMETHEUS_DEPLOY HIGH_GRANULARITY ESTIMATOR_SIDECAR_DEPLOY MODEL_SERVER_DEPLOY TRAINER_DEPLOY BM_DEPLOY"
microk8s kubectl create -f _output/generated-manifest/deployment.yaml
-->
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)