Open flpanbin opened 4 weeks ago
@flpanbin 请问是在什么模式下运行的?(QRM or ORM) 另外可以提供下节点的NUMA信息吗?
"katalyst.kubewharf.io/memory_enhancement": '{
"numa_binding": "true",
"numa_exclusive": "true"
}'
在numa_exclusive = true
的情况下pod会独占整个NUMA
@flpanbin 请问是在什么模式下运行的?(QRM or ORM) 另外可以提供下节点的NUMA信息吗?
"katalyst.kubewharf.io/memory_enhancement": '{ "numa_binding": "true", "numa_exclusive": "true" }'
在
numa_exclusive = true
的情况下pod会独占整个NUMA
QRM模式下运行的,节点的numa信息:
root@ubuntu:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 32145 MB
node 0 free: 30247 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 32248 MB
node 1 free: 30001 MB
node distances:
node 0 1
0: 10 20
1: 20 10
@WangZzzhe 看了下对应节点的日志,有错误提示:err: rpc error: code = Unknown desc = hint is empty,不知道和这个是否有关系。
I0606 01:28:02.100303 1 manager.go:417] [ORM] addContainer, pod: numa-dedicated-normal-pod, container: stress
W0606 01:28:02.100337 1 manager.go:488] [ORM] pod: default/numa-dedicated-normal-pod; container: stress allocate resource: cpu without numa nodes affinity
I0606 01:28:02.101041 1 policy.go:683] "[katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).Allocate] called" podNamespace="default" podName="numa-dedicated-normal-pod" containerName="stress" podType="" podRole="" containerType="MAIN" qosLevel="dedicated_cores" numCPUsInt=1 numCPUsFloat64=1 isDebugPod=false
E0606 01:28:02.101136 1 policy_allocation_handlers.go:304] "[katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).dedicatedCoresWithNUMABindingAllocationHandler] unable to allocate CPUs" err="hint is empty" podNamespace="default" podName="numa-dedicated-normal-pod" containerName="stress" numCPUsInt=1 numCPUsFloat64=1
E0606 01:28:02.101727 1 manager.go:501] [ORM] addContainer allocate fail, pod numa-dedicated-normal-pod, container stress, err: rpc error: code = Unknown desc = hint is empty
E0606 01:28:02.101786 1 manager.go:605] [ORM] re addContainer fail, pod numa-dedicated-normal-pod container stress, err: [ORM] addContainer allocate fail, pod numa-dedicated-normal-pod, container stress, err: rpc error: code = Unknown desc = hint is empty
I0606 01:28:02.102672 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/.3396218561\": CREATE"
I0606 01:28:02.102733 1 plugin_watcher.go:174] "Ignoring file (starts with '.')" path=".3396218561"
I0606 01:28:02.105100 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/kubelet_qrm_checkpoint\": CREATE"
I0606 01:28:02.105244 1 plugin_watcher.go:184] "Ignoring non socket file" path="kubelet_qrm_checkpoint"
I0606 01:28:02.216433 1 provisioner.go:84] [malachite] heartbeat
I0606 01:28:02.217117 1 round_trippers.go:466] curl -v -XGET -H "Accept: application/json, */*" -H "User-Agent: katalyst-agent/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://10.6.202.153:10250/stats/summary?timeout=10s'
I0606 01:28:02.218075 1 round_trippers.go:553] GET https://10.6.202.153:10250/stats/summary?timeout=10s 403 Forbidden in 0 milliseconds
I0606 01:28:02.218100 1 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 0 ms Duration 0 ms
I0606 01:28:02.218192 1 round_trippers.go:577] Response Headers:
I0606 01:28:02.218217 1 round_trippers.go:580] Content-Type: text/plain; charset=utf-8
I0606 01:28:02.218231 1 round_trippers.go:580] Content-Length: 114
I0606 01:28:02.218242 1 round_trippers.go:580] Date: Thu, 06 Jun 2024 01:28:02 GMT
I0606 01:28:02.218272 1 request.go:1154] Response Body: Forbidden (user=system:serviceaccount:katalyst-system:katalyst-agent, verb=get, resource=nodes, subresource=stats)
E0606 01:28:02.218319 1 provisioner.go:65] failed to update stats/summary from kubelet: "failed to get kubelet config for summary api, error: Forbidden (user=system:serviceaccount:katalyst-system:katalyst-agent, verb=get, resource=nodes, subresource=stats)"
I0606 01:28:02.234127 1 pod.go:206] get metric mem.usage.container for pod numa-dedicated-normal-pod, collect time 2024-06-06 01:28:01 +0000 UTC, left len 3
I0606 01:28:02.234210 1 pod.go:206] get metric cpu.load.1min.container for pod numa-dedicated-normal-pod, collect time 2024-06-06 01:28:01 +0000 UTC, left len 2
I0606 01:28:02.234227 1 pod.go:206] get metric cpu.usage.container for pod numa-dedicated-normal-pod, collect time 2024-06-06 01:28:01 +0000 UTC, left len 1
@WangZzzhe agent的参数:
template:
metadata:
annotations:
katalyst.kubewharf.io/qos_level: system_cores
creationTimestamp: null
labels:
app: katalyst-agent
app.kubernetes.io/instance: katalyst-colocation
app.kubernetes.io/name: katalyst-agent
spec:
containers:
- args:
- --plugin-registration-dir=/var/lib/katalyst/plugin-socks
- --checkpoint-manager-directory=/var/lib/katalyst/plugin-checkpoint
- --locking-file=/tmp/katalyst_colocation_katalyst_agent_lock
- --node-name=$(MY_NODE_NAME)
- --node-address=$(MY_NODE_ADDRESS)
- --agents=*
- --cpu-resource-plugin-advisor=true
- --enable-cpu-pressure-eviction=true
- --enable-kubelet-secure-port=true
- --enable-reclaim=true
- --enable-report-topology-policy=true
- --eviction-plugins=*
- --memory-resource-plugin-advisor=true
- --orm-devices-provider=kubelet
- --orm-kubelet-pod-resources-endpoints=/var/lib/kubelet/pod-resources/kubelet.sock
- --orm-resource-names-map=resource.katalyst.kubewharf.io/reclaimed_millicpu=cpu,resource.katalyst.kubewharf.io/reclaimed_memory=memory
- --pod-resources-server-endpoint=/var/lib/katalyst/pod-resources/kubelet.sock
- --qrm-socket-dirs=/var/lib/katalyst/plugin-socks
- --topology-policy-name=none
- --v=9
command:
- katalyst-agent
系统版本信息:
root@ubuntu:~# uname -a
Linux ubuntu 5.4.0-182-generic #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
@flpanbin
修改 --topology-policy-name=none
为 --topology-policy-name=best-effort
再尝试下
none policy下不会进行资源分配
@flpanbin 修改
--topology-policy-name=none
为--topology-policy-name=best-effort
再尝试下 none policy下不会进行资源分配
@WangZzzhe 设置了 --topology-policy-name=best-effort
还是不行
root@ubuntu:~/katalyst# ps -ef | grep katalyst-agent
root 2423499 2423425 15 Jun05 ? 01:31:28 katalyst-metric --leader-elect-resource-name=katalyst-colocation-katalyst-metric --leader-elect-resource-namespace=katalyst-system --collector-pod-selector=app=katalyst-agent
root 2604106 2604047 3 02:59 ? 00:00:05 katalyst-agent --plugin-registration-dir=/var/lib/katalyst/plugin-socks --checkpoint-manager-directory=/var/lib/katalyst/plugin-checkpoint --locking-file=/tmp/katalyst_colocation_katalyst_agent_lock --node-name=node1 --node-address=10.6.202.152 --agents=* --cpu-resource-plugin-advisor=true --enable-cpu-pressure-eviction=true --enable-kubelet-secure-port=true --enable-reclaim=true --enable-report-topology-policy=true --eviction-plugins=* --memory-resource-plugin-advisor=true --orm-devices-provider=kubelet --orm-kubelet-pod-resources-endpoints=/var/lib/kubelet/pod-resources/kubelet.sock --orm-resource-names-map=resource.katalyst.kubewharf.io/reclaimed_millicpu=cpu,resource.katalyst.kubewharf.io/reclaimed_memory=memory --pod-resources-server-endpoint=/var/lib/katalyst/pod-resources/kubelet.sock --qrm-socket-dirs=/var/lib/katalyst/plugin-socks --topology-policy-name=best-effort --v=9
@WangZzzhe 请问这个有什么定位思路吗?还需要提供哪些信息来定位问题呢?
我按照文档安装的 Kubewharf enhanced kubernetes, 文档提供的 helm 命令安装的相关组件:
helm install katalyst-colocation -n katalyst-system --create-namespace kubewharf/katalyst-colocation
root@ubuntu:~/katalyst/examples# kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.6.202.151 Ready control-plane 13d v1.24.6-kubewharf.8
node1 Ready <none> 13d v1.24.6-kubewharf.8
node2 Ready <none> 13d v1.24.6-kubewharf.8
root@ubuntu:~/katalyst/examples# containerd -v
containerd github.com/containerd/containerd v1.4.12 7b11cfaabd73bb80907dd23182b9347b4245eb5d
@flpanbin 根据katalyst-agent的启动参数,应该是运行了ORM模式,kubelet中的qosResourceManager是没有生效的。 从KCNR的信息看这个pod已经分配成功了,独占了一个NUMA的24C
@flpanbin 根据katalyst-agent的启动参数,应该是运行了ORM模式,kubelet中的qosResourceManager是没有生效的。 从KCNR的信息看这个pod已经分配成功了,独占了一个NUMA的24C
但是为什么我查看容器的 cpuset 和 还是显示的 0-47呢,按理说应该是分配了24core
root@ubuntu:~/katalyst# cat /sys/fs/cgroup/cpuset/kubepods/podb077c70f-6103-43f9-ba77-64d67ec736ba/cpuset.cpus
0-47
root@ubuntu:~/katalyst# cat /sys/fs/cgroup/cpuset/kubepods/podb077c70f-6103-43f9-ba77-64d67ec736ba/80c10c4e45cbe275bafac9cf8b74ef72e6c0b9565d33389fce86f6e7c3737843/cpuset.cpus
0-47
root@ubuntu:~/katalyst# cat /sys/fs/cgroup/cpuset/kubepods/podb077c70f-6103-43f9-ba77-64d67ec736ba/604fc985574b2ff117630b699dd4e9ac77c95d316b6ece904385694db0090268/cpuset.cpus
0-47
@flpanbin
@WangZzzhe
syncContainer 没有看到有错误日志,但是看有日志显示分配的结果是 0-47:
cpu for pod: default/dedicated-normal-pod2, container: stress is {CpusetCpus false true 48 0-47 map[] map[] nil {} 0}
I0606 06:28:10.191773 1 manager.go:536] [ORM] reconcile...
I0606 06:28:10.193289 1 policy.go:406] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).GetResourcesAllocation] called
I0606 06:28:10.194287 1 manager.go:550] [ORM] skip getResourceAllocation of resource: resource.katalyst.kubewharf.io/net_bandwidth, because plugin needn't reconciling
I0606 06:28:10.194349 1 manager.go:519] [ORM] syncContainer, pod: katalyst-colocation-katalyst-agent-h6gj8, container: katalyst-agent
I0606 06:28:10.194375 1 manager.go:522] got pod katalyst-colocation-katalyst-agent-h6gj8 container katalyst-agent resources nil
I0606 06:28:10.194450 1 manager.go:519] [ORM] syncContainer, pod: katalyst-colocation-katalyst-metric-85c47ff4bf-7lw4g, container: katalyst-metric
I0606 06:28:10.194472 1 manager.go:522] got pod katalyst-colocation-katalyst-metric-85c47ff4bf-7lw4g container katalyst-metric resources nil
I0606 06:28:10.194550 1 manager.go:630] [ORM] allocation information for resources memory - accompanying resource: memory for pod: default/dedicated-normal-pod2, container: stress is {CpusetMems false true 6.7522060288e+10 0-1 map[] map[] nil {} 0}
I0606 06:28:10.194723 1 manager.go:630] [ORM] allocation information for resources cpu - accompanying resource: cpu for pod: default/dedicated-normal-pod2, container: stress is {CpusetCpus false true 48 0-47 map[] map[] nil {} 0}
I0606 06:28:10.194822 1 manager.go:519] [ORM] syncContainer, pod: dedicated-normal-pod2, container: stress
I0606 06:28:10.195592 1 manager.go:519] [ORM] syncContainer, pod: malachite-fvp5p, container: malachite
I0606 06:28:10.195679 1 manager.go:522] got pod malachite-fvp5p container malachite resources nil
I0606 06:28:10.195707 1 manager.go:654] [ORM] map resource name: resource.katalyst.kubewharf.io/reclaimed_millicpu to cpu
I0606 06:28:10.195784 1 manager.go:654] [ORM] map resource name: resource.katalyst.kubewharf.io/reclaimed_memory to memory
I0606 06:28:10.195804 1 manager.go:630] [ORM] allocation information for resources memory - accompanying resource: memory for pod: default/reclaimed-large-pod-node1, container: stress is {CpusetMems false true 0 0-1 map[] map[] nil {} 0}
I0606 06:28:10.195919 1 manager.go:654] [ORM] map resource name: resource.katalyst.kubewharf.io/reclaimed_millicpu to cpu
I0606 06:28:10.195988 1 manager.go:630] [ORM] allocation information for resources cpu - accompanying resource: cpu for pod: default/reclaimed-large-pod-node1, container: stress is {CpusetCpus false true 40 4-23,28-47 map[] map[] nil {} 0}
I0606 06:28:10.196018 1 manager.go:519] [ORM] syncContainer, pod: reclaimed-large-pod-node1, container: stress
I0606 06:28:10.197415 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/.2613201109\": CREATE"
I0606 06:28:10.197480 1 plugin_watcher.go:174] "Ignoring file (starts with '.')" path=".2613201109"
I0606 06:28:10.199087 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/kubelet_qrm_checkpoint\": CREATE"
I0606 06:28:10.199115 1 plugin_watcher.go:184] "Ignoring non socket file" path="kubelet_qrm_checkpoint"
I0606 06:28:10.287012 1 manager.go:312] genericSync
I0606 06:28:10.288051 1 manager.go:387] "GetReportContent" costs="928.548µs" pluginName="headroom-reporter-plugin"
I0606 06:28:10.288224 1 manager.go:387] "GetReportContent" costs="121.926µs" pluginName="system-reporter-plugin"
I0606 06:28:10.288510 1 round_trippers.go:466] curl -v -XGET -H "Accept: application/json, */*" -H "User-Agent: katalyst-agent/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://10.6.202.152:10250/pods?timeout=10s'
@WangZzzhe
/var/lib/katalyst/qrm_advisor/cpu_plugin_state
显示分配了多次,最后一次的分配结果是`"allocation_result": "0-47",
{
"policyName": "dynamic",
"machineState": {
"0": {
"default_cpuset": "",
"allocated_cpuset": "0-23",
"pod_entries": {
"754dc697-7658-47cc-8faa-0280c2931925": {
"stress": {
"pod_uid": "754dc697-7658-47cc-8faa-0280c2931925",
"pod_namespace": "default",
"pod_name": "dedicated-normal-pod2",
"container_name": "stress",
"container_type": "MAIN",
"owner_pool_name": "dedicated",
"allocation_result": "0-23",
"original_allocation_result": "0-23",
"topology_aware_assignments": {
"0": "0-23"
},
"original_topology_aware_assignments": {
"0": "0-23"
},
"init_timestamp": "2024-06-06 05:09:27.902883009 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores",
"numa_binding": "true",
--
"original_topology_aware_assignments": {
"0": "0-19"
},
"init_timestamp": "2024-06-06 05:58:40.67405812 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"qosLevel": "reclaimed_cores",
"request_quantity": 42000
}
}
}
},
"1": {
"default_cpuset": "",
"allocated_cpuset": "24-47",
"pod_entries": {
"754dc697-7658-47cc-8faa-0280c2931925": {
"stress": {
"pod_uid": "754dc697-7658-47cc-8faa-0280c2931925",
"pod_namespace": "default",
"pod_name": "dedicated-normal-pod2",
"container_name": "stress",
"container_type": "MAIN",
"owner_pool_name": "dedicated",
"allocation_result": "24-47",
"original_allocation_result": "24-47",
"topology_aware_assignments": {
"1": "24-47"
},
"original_topology_aware_assignments": {
"1": "24-47"
},
"init_timestamp": "2024-06-06 05:09:27.902883009 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores",
"numa_binding": "true",
--
"1": "24-43"
},
"original_topology_aware_assignments": {
"1": "24-43"
},
"init_timestamp": "2024-06-06 05:58:40.67405812 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"qosLevel": "reclaimed_cores",
"request_quantity": 42000
}
}
}
}
},
"pod_entries": {
"754dc697-7658-47cc-8faa-0280c2931925": {
"stress": {
"pod_uid": "754dc697-7658-47cc-8faa-0280c2931925",
"pod_namespace": "default",
"pod_name": "dedicated-normal-pod2",
"container_name": "stress",
"container_type": "MAIN",
"owner_pool_name": "dedicated",
"allocation_result": "0-47",
"original_allocation_result": "0-47",
"topology_aware_assignments": {
"0": "0-23",
"1": "24-47"
},
"original_topology_aware_assignments": {
"0": "0-23",
"1": "24-47"
},
"init_timestamp": "2024-06-06 05:09:27.902883009 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores"
},
"annotations": {
...
What happened?
I created a dedicated_cores pod, but it not have an exclusive CPU.
dedicated_cores_pod.yaml:
check the cpuset for the pod:
What did you expect to happen?
The dedicated_cores pod should be allocated an exclusive CPU core.
How can we reproduce it (as minimally and precisely as possible)?
Create a dedicated cores pod, like dedicated_cores_pod.yaml, as mentioned above.
Software version