Closed vincent-pli closed 2 years ago
There are rare cases where tenant and super can be inconsistent.
For example, deleting pods requires kubelet to delete the Pod with graceperiod 0. In VC, it is the syncer's duty to finally delete the tenant Pods with graceperiod 0. If the super cluster Pod is deleted too quickly, the syncer may not be able to delete the tenant Pod (the object in super cluster is gone and syncer cannot find the tenant Pod through the label in super Pod). Another example is that, if tenant user runs a tight loop to update object with different spec A & B in turn, the state may also be inconsistent after the loop is done due to informer cache delay. Besides above, if the syncer or super cluster crashes during synchronization, the state can be inconsistent(e.g., the object in tenant is updated/deleted when syncer is offline, syncer needs to perform full scan when it is restarted to update/delete the stale objecs in super cluster). Indeed, we can resolve all those problems by enhancing the component restart code path or syncer code to even check the object revision number. Since the syncer needs to synchronize the states of two apiservers via two informer caches, no one can guarantee the complete coverage of all subtle timing issues. We choose a conservative periodic check to ease the error handling code in other code paths for simplicity. I am sure there are other design choices of handling these problems without periodic check.
Note that debugging the inconsistency problem in production environment will be extremely difficult and nightmare for SRE hence the periodic check will be our last resort. We do have metrics/logs to record inconsistent objects for further investigation.
Thanks @Fei-Guo , very useful, I will keep the issue open I guess people may has same question.
@vincent-pli would you like to turn this into a PR with these notes written up in the virtualcluster readme?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
No matter
upward
(UWController) ordownward
(MCController) synchronization are all based oninformer
. so why we still need a 60s pollingPatroller
to compare resource from super and tenant manually?Is some potential risk for
informer
? thanks. @Fei-Guo @charleszheng44 @christopherhein