Open flpanbin opened 1 month ago
@WangZzzhe 帮忙看看
@flpanbin 可以提供下节点的相关信息吗? 1、创建测试pod前节点的request总量和负载; 2、测试pod的request和负载
@flpanbin 可以提供下节点的相关信息吗? 1、创建测试pod前节点的request总量和负载; 2、测试pod的request和负载
创建 pod 前节点的资源信息:
apiVersion: v1
kind: Node
metadata:
annotations:
katalyst.kubewharf.io/cpu_overcommit_ratio: "2.5"
katalyst.kubewharf.io/memory_overcommit_ratio: "2.5"
katalyst.kubewharf.io/original_allocatable_cpu: "16"
katalyst.kubewharf.io/original_allocatable_memory: 32676068Ki
katalyst.kubewharf.io/original_capacity_cpu: "16"
katalyst.kubewharf.io/original_capacity_memory: 32778468Ki
katalyst.kubewharf.io/overcommit_allocatable_cpu: 27840m
katalyst.kubewharf.io/overcommit_allocatable_memory: 38479337676800m
katalyst.kubewharf.io/overcommit_capacity_cpu: 27840m
katalyst.kubewharf.io/overcommit_capacity_memory: 38599923916800m
katalyst.kubewharf.io/realtime_cpu_overcommit_ratio: "1.74"
katalyst.kubewharf.io/realtime_memory_overcommit_ratio: "1.15"
...
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
katalyst.kubewharf.io/overcommit_node_pool: overcommit-demo
kubernetes.io/arch: amd64
kubernetes.io/hostname: g-master2
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: ""
......
name: g-master2
status:
addresses:
- address: g-master2
type: Hostname
allocatable:
cpu: 27840m
ephemeral-storage: "136351265362"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 38479337676800m
pods: "180"
capacity:
cpu: 27840m
ephemeral-storage: 144483Mi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 38599923916800m
pods: "180"
testpod1.yaml :
apiVersion: v1
kind: Pod
metadata:
name: testpod1
namespace: katalyst-system
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- g-master2
containers:
- name: testcontainer1
image: polinux/stress:latest
command: ["stress"]
args: ["--cpu", "4", "--timeout", "6000"]
resources:
limits:
cpu: 8
memory: 8Gi
requests:
cpu: 4
memory: 8Gi
tolerations:
- effect: NoSchedule
key: test
value: test
operator: Equal
@flpanbin
对于内存来说是合理的,因为内存的申请量增加了但是负载没有变化。
理论上CPU在pod创建成功,但stress负载还没起来的情况下是可能上升的,但稳定后相比之前应该是下降的。可以调整日志等级为6后观察下采集的数据是否准确。
https://github.com/kubewharf/katalyst-core/blob/main/pkg/agent/sysadvisor/plugin/overcommitmentaware/realtime/realtime.go#L154
https://github.com/kubewharf/katalyst-core/blob/main/pkg/agent/sysadvisor/plugin/overcommitmentaware/realtime/realtime.go#L158
@flpanbin 对于内存来说是合理的,因为内存的申请量增加了但是负载没有变化。 理论上CPU在pod创建成功,但stress负载还没起来的情况下是可能上升的,但稳定后相比之前应该是下降的。可以调整日志等级为6后观察下采集的数据是否准确。 https://github.com/kubewharf/katalyst-core/blob/main/pkg/agent/sysadvisor/plugin/overcommitmentaware/realtime/realtime.go#L154 https://github.com/kubewharf/katalyst-core/blob/main/pkg/agent/sysadvisor/plugin/overcommitmentaware/realtime/realtime.go#L158
感谢您的及时回复,我再观察下日志,不过针对您的回答有几个疑问:
@flpanbin 在负载不变的情况下,资源申请量增加,节点可分配资源减少,导致节点需要超分更多的资源来达到目标负载值。 具体的规则可以参考https://github.com/kubewharf/katalyst-core/blob/main/pkg/agent/sysadvisor/plugin/overcommitmentaware/realtime/realtime.go#L286
@flpanbin 在负载不变的情况下,资源申请量增加,节点可分配资源减少,导致节点需要超分更多的资源来达到目标负载值。 具体的规则可以参考https://github.com/kubewharf/katalyst-core/blob/main/pkg/agent/sysadvisor/plugin/overcommitmentaware/realtime/realtime.go#L286
感谢大佬,我研究研究。
@WangZzzhe 定位了下,应该是指标采集的问题,计算超分比时参与计算的 usage 是0。
I0609 01:45:03.275865 1 realtime.go:335] resource cpu request: 11964, allocatable: 16000, usage: 0, targetLoad: 0.6, existLoad: 0.4, overcommitRatio: 2.24775
overcommit-katalyst-agent 日志:
I0609 03:01:06.734172 1 provisioner.go:84] [malachite] heartbeat
E0609 03:01:06.738246 1 provisioner.go:111] [malachite] malachite is unhealthy: invalid http response status code 500, url: http://localhost:9002/api/v1/system/compute
I0609 03:01:06.738555 1 round_trippers.go:553] GET https://10.6.202.113:10250/stats/summary?timeout=10s 403 Forbidden in 3 milliseconds
E0609 03:01:06.739508 1 provisioner.go:65] failed to update stats/summary from kubelet: "failed to get kubelet config for summary api, error: Forbidden (user=system:serviceaccount:katalyst-system:katalyst-agent, verb=get, resource=nodes, subresource=stats)"
I0609 03:01:08.043645 1 realtime.go:155] [overcommitment-aware-realtime] sumUpPodsResources, cpu: 1845m, memory: 3715141632
E0609 03:01:08.043814 1 store_util.go:98] failed to get metric pod prometheus-insight-agent-kube-prometh-prometheus-0, container prometheus, metric cpu.usage.container, err: [MetricStore] empty map
E0609 03:01:08.044067 1 store_util.go:98] failed to get metric pod prometheus-insight-agent-kube-prometh-prometheus-0, container config-reloader, metric cpu.usage.container, err: [MetricStore] empty map
malachite 日志报错,应该是没有正常工作:
panbin@panbindeMacBook-Pro ~ % kubectl logs malachite-xk8n9 -n malachite-system -f
2024-06-09T02:03:07.481004862+00:00 - [ERROR] server/src/main.rs:187 [Panic] lib/src/cpu/processor.rs:464: called `Result::unwrap()` on an `Err` value: ParseIntError { kind: Empty }
2024-06-09T02:03:07.489192152+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:11.271581881+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:11.271754576+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:16.338537826+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:16.338612068+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:21.407855335+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:21.408025943+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:26.450034224+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:26.451268751+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:31.459491370+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:31.459570543+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:36.486691177+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:36.486756735+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:41.575957128+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:41.589261474+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:46.624823586+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:46.624905589+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:51.695793619+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:51.695892044+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:03:56.827341960+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:03:56.827457338+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:04:01.853256781+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
2024-06-09T02:04:01.853372828+00:00 - [ERROR] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/rocket-0.5.0-rc.2/src/server.rs:56 Handler compute panicked.
2024-06-09T02:04:06.899599297+00:00 - [ERROR] server/src/main.rs:187 [Panic] /root/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/once_cell-1.17.0/src/lib.rs:1276: Lazy instance has previously been poisoned
可能是和 linux 版本有关,环境信息:
[root@g-master1 ~]# uname -a
Linux g-master1 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@g-master1 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
k8s 和 containerd 版本:
[root@g-master1 ~]# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:48:26Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:42:11Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}
[root@g-master1 ~]# containerd -v
containerd github.com/containerd/containerd v1.7.6 091922f03c2762540fd057fba91260237ff86acb
我另外搭建了一个环境,使用 kubewharf enhanced kubernetes, 动态超分功能验证正常,看样子是对 Linux 内核版本和 containerd 的环境有要求? 环境信息如下:
root@ubuntu:~/katalyst# uname -a
Linux ubuntu 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@ubuntu:~/katalyst# kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.6.202.170 Ready control-plane 26m v1.24.6-kubewharf.8
root@ubuntu:~/katalyst# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
root@ubuntu:~/katalyst# containerd -v
containerd github.com/containerd/containerd v1.4.12 7b11cfaabd73bb80907dd23182b9347b4245eb5d
@flpanbin malachite 依赖 ebpf,所以 3.10 的内核应该不太行。4.19+ 应该可以
What happened?
我按照 动态超分的文档体验了下动态超分功能,但是在创建 testpod1 增加 cpu的消耗后,cpu的超分比 cpu_overcommit_ratio 不降反升。
没有pod运行时,查看 g-master2 的kcnr:
创建 testpod1 后,再次查看 g-master2 的kcnr:
katalyst 版本:
What did you expect to happen?
创建 testpod1 后, 对应节点的 cpu 超分比 katalyst.kubewharf.io/cpu_overcommit_ratio 降低。
How can we reproduce it (as minimally and precisely as possible)?
按照这个文档操作即可:https://gokatalyst.io/docs/user-guide/resource-overcommitment/dynamic-overcommitment/
Software version