Closed vincent-pli closed 3 years ago
@Fei-Guo @christopherhein @charleszheng44
This is known issue and mentioned in the project README.
The syncer controller manages the lifecycle of the node objects in tenant control plane but it does not update the node lease objects in order to reduce network traffic. As a result, it is recommended to increase the tenant control plane node controller --node-monitor-grace-period parameter to a larger value ( >60 seconds, done in the sample clusterversion yaml already).
We have played the trick in the clusterversion yaml but not the CAPI implementation.
@vincent-pli Are you using CAPN provisioner? Can you double check the setting of node-monitor-grace-period?
In https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/controlplane/nested/component-templates/nested-controllermanager/nested-controllermanager-statefulset-template.yaml, it is set to 200s.
@Fei-Guo Actually I follow the demo document
But my node-monitor-grace-period
is 200s:
- --bind-address=0.0.0.0
- --cluster-cidr=10.200.0.0/16
- --cluster-signing-cert-file=/etc/kubernetes/pki/root/tls.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/root/tls.key
- --kubeconfig=/etc/kubernetes/kubeconfig/controller-manager-kubeconfig
- --authorization-kubeconfig=/etc/kubernetes/kubeconfig/controller-manager-kubeconfig
- --authentication-kubeconfig=/etc/kubernetes/kubeconfig/controller-manager-kubeconfig
- --leader-elect=false
- --root-ca-file=/etc/kubernetes/pki/root/tls.crt
- --service-account-private-key-file=/etc/kubernetes/pki/service-account/tls.key
- --service-cluster-ip-range=10.32.0.0/24
- --use-service-account-credentials=true
- --experimental-cluster-signing-duration=87600h
- --node-monitor-grace-period=200s
- --v=2
BTW, I'm not dig too much of the code, why we still need nodelifecycle
in tenant cluster? the node is not a real one.
Can you double check if the vNode in the tenant controller get heart-beat updated every minute? If not, the syncer has problem. If the heart beat is updated, maybe 1.20 changed the node lifecycle controller behavior. My testbed is still 1.18 and I will take a look at 1.20.
We don't need nodelifecycle controller, we just don't have a simple way to disable it. One hacky way is to delete the node-controller
serviceaccount from kube-system namespace, which is not recommended.
I think the syncer
works, since the NotReady
will be correct to Ready
after a while, I think it's because syncer
do its job.
To disable the nodelifecycle
, a little confuse, why not use the config like this:
--controllers=*,-nodelifecycle
and I'm not sure how to check the heart-beat, in the log of kube-controller-manager
in the tenant cluster?
@Fei-Guo
Good call. We can try --controllers=*,-nodelifecycle
vNode status should have the heart beat information.
conditions:
- lastHeartbeatTime: "2021-07-03T05:28:46Z"
lastTransitionTime: "2020-12-16T23:03:59Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
Thanks @Fei-Guo The heart-beat is good:
- lastHeartbeatTime: "2021-07-03T06:53:43Z"
lastTransitionTime: "2021-07-01T00:17:55Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
What steps did you take and what happened: Created
deploy
in tenant cluster andpod
works as expected, but get message regular(5min) insyncer
's log:Do some research, seems:
The
nodelifecycle
controller ofkube-manager-controller
in tenant cluster can not contact the virtual node, then set the status of node asNotReady
, then add taint to the node andeviction manager
try to eviction pod in that virtual node, then set thecondition
of the node asstatus: false
Then the
pod syncer
diff the status and find the difference, then raise the message and sync the status of pod.After a while the
node syncer
find the node difference and sync the status of node asReady
and the process occurred regular.
So I try to disable the
nodelifecycle
inkube-controller-manager
of tenant cluster then everything works as expected.So could we disable the
nodelifecycle
?/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api-provider-nested/labels?q=area for the list of labels]