kubewharf / katalyst-core

Katalyst aims to provide a universal solution to help improve resource utilization and optimize the overall costs in the cloud. This is the core components in Katalyst system, including multiple agents and centralized components
Apache License 2.0
389 stars 91 forks source link

Katalyst-colocation-orm can be installed on enhanced-k8s cluster but katalyst-colocation cannot be installed #617

Open ozline opened 3 weeks ago

ozline commented 3 weeks ago

What happened?

I followed Colocate your application using Katalyst to install Katalyst.

It mentioned that if you use Kubewharf enhanced kubernetes, install katalyst-colocation

And if you use vanilla kubernetes, install katalyst-colocation-orm

My node follows Install Kubewharf enhanced-k8s to install enhanced k8s, but only katalyst-colocation-orm can be installed instead of katalyst-colocation

If I install katalyst-colocation, it will report the following error in katalyst-colocation-agent

I0610 13:10:27.641756       1 state_checkpoint.go:121] "[cpu_plugin] State checkpoint: restored state from checkpoint"
I0610 13:10:27.641777       1 util.go:68] [katalyst-core/pkg/agent/qrm-plugins/cpu/util.GetCoresReservedForSystem] get reservedQuantityInt: 0 from ReservedCPUCores configuration
I0610 13:10:27.641787       1 util.go:77] [katalyst-core/pkg/agent/qrm-plugins/cpu/util.GetCoresReservedForSystem] take reservedCPUs:  by reservedCPUsNum: 0
I0610 13:10:27.641832       1 policy.go:950] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).cleanPools] there is no pool to delete
I0610 13:10:27.641842       1 policy.go:964] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).initReservePool] initReservePool reserve:
I0610 13:10:27.641859       1 state_mem.go:109] "[cpu_plugin] updated cpu plugin pod entries" podUID="reserve" containerName="" allocationInfo="{\"pod_uid\":\"reserve\",\"owner_pool_name\":\"reserve\",\"allocation_result\":\"\",\"original_allocation_result\":\"\",\"topology_aware_assignments\":{},\"original_topology_aware_assignments\":{},\"init_timestamp\":\"\",\"labels\":null,\"annotations\":null,\"qosLevel\":\"\"}"
I0610 13:10:27.644274       1 policy.go:1039] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).initReclaimPool] exist initial reclaim: 0-9
I0610 13:10:27.644300       1 agent.go:102] needToRun "qrm_cpu_plugin"
I0610 13:10:27.644308       1 agent.go:91] initializing "qrm_io_plugin"
I0610 13:10:27.644320       1 agent.go:102] needToRun "qrm_io_plugin"
I0610 13:10:27.644325       1 agent.go:91] initializing "qrm_network_plugin"
W0610 13:10:27.644335       1 util.go:122] [katalyst-core/pkg/agent/qrm-plugins/network/staticpolicy.filterNICsByAvailability] nic: eno1 doesn't have IP address
I0610 13:10:27.644344       1 util.go:302] [katalyst-core/pkg/agent/qrm-plugins/network/staticpolicy.getReservedBandwidth] reservedBanwidth: 0, nicCount: 1, policy: first,
I0610 13:10:27.644361       1 state_net.go:47] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.NewNetworkPluginState] initializing new network plugin in-memory state store"
I0610 13:10:27.644372       1 util.go:37] [GenerateMachineState: katalyst-core/pkg/agent/qrm-plugins/network/state.GenerateMachineState] NIC wlp2s0's speed: -1, capacity: [0/0], reservation: 0
I0610 13:10:27.644511       1 util.go:37] [GenerateMachineState: katalyst-core/pkg/agent/qrm-plugins/network/state.GenerateMachineState] NIC wlp2s0's speed: -1, capacity: [0/0], reservation: 0
I0610 13:10:27.644531       1 state_net.go:121] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.(*networkPluginState).SetMachineState] updated network plugin machine state" NICMap="{\"wlp2s0\":{\"egress_state\":{\"Capacity\":0,\"SysReservation\":0,\"Reservation\":0,\"Allocatable\":0,\"Allocated\":0,\"Free\":0},\"ingress_state\":{\"Capacity\":0,\"SysReservation\":0,\"Reservation\":0,\"Allocatable\":0,\"Allocated\":0,\"Free\":0},\"pod_entries\":{}}}"
I0610 13:10:27.644543       1 state_net.go:145] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.(*networkPluginState).SetPodEntries] updated network plugin pod resource entries" podEntries="{}"
I0610 13:10:27.644555       1 state_checkpoint.go:136] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.(*stateCheckpoint).restoreState] state checkpoint: restored state from checkpoint"
I0610 13:10:27.644572       1 policy.go:177] [katalyst-core/pkg/agent/qrm-plugins/network/staticpolicy.(*StaticPolicy).ApplyConfig] apply configs, qosLevelToNetClassMap: map[dedicated_cores:0 reclaimed_cores:0 shared_cores:0 system_cores:0], podLevelNetClassAnnoKey: katalyst.kubewharf.io/net_class_id, podLevelNetAttributesAnnoKeys: []
I0610 13:10:27.644581       1 agent.go:102] needToRun "qrm_network_plugin"
I0610 13:10:27.644588       1 agent.go:91] initializing "periodical-handler-manager"
I0610 13:10:27.644593       1 agent.go:102] needToRun "periodical-handler-manager"
I0610 13:10:27.644600       1 agent.go:91] initializing "katalyst-agent-orm"
I0610 13:10:27.644631       1 manager.go:86] "Creating topology manager with policy per scope" topologyPolicyName=""
E0610 13:10:27.644640       1 manager.go:129] unknown policy: ""
E0610 13:10:27.644647       1 agent.go:94] Error initializing "katalyst-agent-orm"
I0610 13:10:27.644662       1 file.go:257] [GetUniqueLock] release lock successfully
I0610 13:10:28.396105       1 file.go:90] fsNotify watcher notify "/var/lib/kubelet/resource-plugins/kubelet_qrm_checkpoint": CREATE
I0610 13:10:28.396155       1 topology_adapter.go:281] qrm state file changed, notify to update topology status
I0610 13:10:28.396166       1 kubeletplugin.go:177] send topology change notification to plugin kubelet-reporter-plugin
run command error: failed to init ORM: unknown policy: ""

Only katalyst-agent not working

root@debian-node-1:~# kubectl get pods -n katalyst-system
NAME                                                       READY   STATUS             RESTARTS      AGE
katalyst-colocation-katalyst-agent-f5glx                   0/1     CrashLoopBackOff   4 (36s ago)   2m32s
katalyst-colocation-katalyst-agent-jzgft                   0/1     CrashLoopBackOff   4 (52s ago)   2m32s
katalyst-colocation-katalyst-controller-59b5c89cd6-jcn9m   1/1     Running            0             2m32s
katalyst-colocation-katalyst-controller-59b5c89cd6-vpjvq   1/1     Running            0             2m32s
katalyst-colocation-katalyst-metric-85c47ff4bf-nl9sf       1/1     Running            0             2m32s
katalyst-colocation-katalyst-scheduler-77cdd9d66f-8mszz    1/1     Running            0             2m32s
katalyst-colocation-katalyst-scheduler-77cdd9d66f-c27qc    1/1     Running            0             2m32s
katalyst-colocation-katalyst-webhook-5f6ccc7cb-ngz2x       1/1     Running            0             2m32s
katalyst-colocation-katalyst-webhook-5f6ccc7cb-vrnzs       1/1     Running            0             2m32s

But install katalyst-colocation-orm in Kubewharf enhanced kubernetes work fine(pod status of agent is Running

What did you expect to happen?

install katalyst-colocation in KubeWharf-enhanced-kubernetes work fine

How can we reproduce it (as minimally and precisely as possible)?

Install katalyst-colocation using helm after installing KubeWharf-enhanced-kubernetes

helm install katalyst-colocation -n katalyst-system --create-namespace kubewharf/katalyst-colocation

Software version

No response

pendoragon commented 3 weeks ago

I think we have some issue with the katalyst-colocation helm chart here which enables orm by default, will have to fix it.

BTW, installing kubewharf enhanced kubernetes is error-prone and the installation guide is not universal enough to cover every scenario. so if possible I would recommend trying katalyst-colocation-orm on a vanilla kubernetes.