kubewharf / katalyst-core

Katalyst aims to provide a universal solution to help improve resource utilization and optimize the overall costs in the cloud. This is the core components in Katalyst system, including multiple agents and centralized components
Apache License 2.0
388 stars 91 forks source link

[install error] katalyst-agent CrashLoopBackOff #566

Closed googs1025 closed 2 months ago

googs1025 commented 2 months ago

What happened?

root@VM-0-15-ubuntu:/home/ubuntu# kubectl get pods -nkatalyst-system
NAME                                   READY   STATUS             RESTARTS         AGE
katalyst-agent-4qx2t                   0/1     CrashLoopBackOff   10 (31s ago)     26m
katalyst-agent-jdl97                   0/1     CrashLoopBackOff   10 (22s ago)     26m
katalyst-agent-pwm7l                   0/1     Error              10 (5m11s ago)   26m
katalyst-controller-845ccf946b-ftxgx   1/1     Running            0                26m
katalyst-controller-845ccf946b-lm9bm   1/1     Running            0                26m
katalyst-metric-765c44bbb5-48ws6       1/1     Running            0                26m
katalyst-scheduler-5746f9bd4c-swgc4    1/1     Running            0                26m
katalyst-scheduler-5746f9bd4c-x2vct    1/1     Running            0                26m
katalyst-webhook-68fcf99cd8-26c8g      1/1     Running            0                26m
katalyst-webhook-68fcf99cd8-7fs78      1/1     Running            0                26m
root@VM-0-15-ubuntu:/home/ubuntu# kubectl logs katalyst-agent-4qx2t -nkatalyst-system
W0502 08:03:20.626350       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024/05/02 08:03:20 <nil>
I0502 08:03:20.626831       1 otel_prom_metrics_mux.go:94] [katalyst-core/pkg/metrics/metrics-pool.(*openTelemetryPrometheusMetricsEmitterPool).GetMetricsEmitter] add path /metrics to metric emitter
W0502 08:03:20.636464       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
I0502 08:03:20.636778       1 network_linux.go:80] [katalyst-core/pkg/util/machine.GetExtraNetworkInfo] namespace list: []
W0502 08:03:20.637199       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: eth0 with devPath: /sys/devices/virtual/net/eth0 which isn't pci device
W0502 08:03:20.637248       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: kube-ipvs0 with devPath: /sys/devices/virtual/net/kube-ipvs0 which isn't pci device
W0502 08:03:20.637281       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: lo with devPath: /sys/devices/virtual/net/lo which isn't pci device
W0502 08:03:20.637311       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth064d18ee with devPath: /sys/devices/virtual/net/veth064d18ee which isn't pci device
W0502 08:03:20.637339       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth06d57915 with devPath: /sys/devices/virtual/net/veth06d57915 which isn't pci device
W0502 08:03:20.637365       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth5290716c with devPath: /sys/devices/virtual/net/veth5290716c which isn't pci device
W0502 08:03:20.637396       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth6f37d282 with devPath: /sys/devices/virtual/net/veth6f37d282 which isn't pci device
W0502 08:03:20.637428       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth87922afb with devPath: /sys/devices/virtual/net/veth87922afb which isn't pci device
W0502 08:03:20.637457       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth8dccdf2e with devPath: /sys/devices/virtual/net/veth8dccdf2e which isn't pci device
I0502 08:03:20.638040       1 file.go:239] [GetUniqueLock] get lock successfully
I0502 08:03:20.638069       1 agent.go:85] initializing "katalyst-agent-reporter"
W0502 08:03:20.638121       1 manager.go:400] failed to retrieve checkpoint for "reporter_manager_checkpoint": checkpoint is not found
I0502 08:03:20.638136       1 manager.go:258] registered plugin name system-reporter-plugin
I0502 08:03:20.638153       1 manager.go:239] plugin system-reporter-plugin run success
I0502 08:03:20.638171       1 manager.go:258] registered plugin name kubelet-reporter-plugin
I0502 08:03:20.638210       1 util_unix.go:104] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/var/lib/kubelet/pod-resources/kubelet.sock" URL="unix:///var/lib/kubelet/pod-resources/kubelet.sock"
F0502 08:03:20.638341       1 kubeletplugin.go:110] run topology status adapter failed

What did you expect to happen?

All pods start normally

How can we reproduce it (as minimally and precisely as possible)?

None

Software version

Environment: Kubernetes version (use kubectl version): 1.28 OS version: Ubuntu 22.04 Kernal version: Cgroup driver: cgroupfs/systemd
googs1025 commented 2 months ago

/kind bug

luomingmeng commented 2 months ago

It may have some errors when run topology status adapter , we have add some error messages in the fatal log [https://github.com/kubewharf/katalyst-core/pull/573]()

googs1025 commented 2 months ago

It has been solved now. If there are still problems, I will reopen it.