[BUG] node lifecycle controller in yurt-manager can not update status of node

crazytaxii commented 9 months ago

What happened: Node always stays with Ready status after stopping kubelet on it, even shutting down the node itself. The bug causes the Pods can not be migrated to other nodes.

What you expected to happen: The abnormal node should be updated into NotReady status.

How to reproduce it (as minimally and precisely as possible): Stopping the kubelet on a node.

Anything else we need to know?: Error log in yurt-manager's node lifecycle controller:

E0126 07:43:15.444074       1 node_lifecycle_controller.go:975] "Error updating node" err="nodes \"edge\" is forbidden: User \"system:serviceaccount:kube-system:yurt-manager\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope" node="edge"
E0126 07:43:15.452574       1 node_lifecycle_controller.go:715] "Update health of Node from Controller error, Skipping - no pods will be evicted" err="timed out waiting for the condition" node="edge"

nodes/status is a subresource, it should be added to the ClusterRole of yurt-manager also.

Environment:

OpenYurt version: v1.4
Kubernetes version (use kubectl version): v1.27.2

/kind bug

rambohe-ch commented 9 months ago

@crazytaxii Thanks for raising issue. It seems that rbac settings of nodelifecycle had been missed. would you like to make a pull request to fix it?

crazytaxii commented 9 months ago

/assign @crazytaxii

crazytaxii commented 9 months ago

It has been fixed in #1884.

crazytaxii commented 9 months ago

The entire system:controller:node-controller ClusterRole for kube-controller-manager in Kubernetes cluster v1.27.2 is:

# ...
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - delete
  - get
  - list
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - delete
  - list
- apiGroups:
  - networking.k8s.io
  resources:
  - clustercidrs
  verbs:
  - create
  - get
  - list
  - update
- apiGroups:
  - ""
  - events.k8s.io
  resources:
  - events
  verbs:
  - create
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get

Compare to the ClusterRole of yurt-manager(v1.4):

# ...
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
#  - delete # missing one
  - get
  - list
  - patch
  - update
  - watch # extra one
# - apiGroups: # missing one
#  - ""
#  resources:
#  - nodes/status
#  verbs:
#  - patch
#  - update
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - create # extra one
  - delete
  - get
  - list
  - patch # extra one
  - update # extra one
  - watch # extra one
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
#  - patch # missing one
  - update
# - apiGroups: # missing one
#  - networking.k8s.io
#  resources:
#  - clustercidrs
#  verbs:
#  - create
#  - get
#  - list
#  - update
# - apiGroups: # missing one
#  - ""
#  - events.k8s.io
#  resources:
#  - events
#  verbs:
#  - create
#  - patch
#  - update
# ...

But the node lifecycle controller in yurt-manager differs a lot from the one in kube-controller-manager v1.27.2 definitely.

rambohe-ch commented 9 months ago

clustercidrs

@crazytaxii Except networking.k8s.io/clustercidrs resource, other missed rbac settings should be added to yurt-manager. because networking.k8s.io/clustercidrs is used by node ipam controller in kube-controller-manager, and it is not needed by nodelifecycle controller.

openyurtio / openyurt

[BUG] node lifecycle controller in yurt-manager can not update status of node #1934