kubernetes-retired / cluster-api-provider-nested

Cluster API Provider for Nested Clusters
Apache License 2.0
301 stars 67 forks source link

Why need Patroller #208

Closed vincent-pli closed 2 years ago

vincent-pli commented 3 years ago

No matter upward(UWController) or downward(MCController) synchronization are all based on informer. so why we still need a 60s polling Patroller to compare resource from super and tenant manually?

Is some potential risk for informer? thanks. @Fei-Guo @charleszheng44 @christopherhein

Fei-Guo commented 3 years ago

There are rare cases where tenant and super can be inconsistent.

For example, deleting pods requires kubelet to delete the Pod with graceperiod 0. In VC, it is the syncer's duty to finally delete the tenant Pods with graceperiod 0. If the super cluster Pod is deleted too quickly, the syncer may not be able to delete the tenant Pod (the object in super cluster is gone and syncer cannot find the tenant Pod through the label in super Pod). Another example is that, if tenant user runs a tight loop to update object with different spec A & B in turn, the state may also be inconsistent after the loop is done due to informer cache delay. Besides above, if the syncer or super cluster crashes during synchronization, the state can be inconsistent(e.g., the object in tenant is updated/deleted when syncer is offline, syncer needs to perform full scan when it is restarted to update/delete the stale objecs in super cluster). Indeed, we can resolve all those problems by enhancing the component restart code path or syncer code to even check the object revision number. Since the syncer needs to synchronize the states of two apiservers via two informer caches, no one can guarantee the complete coverage of all subtle timing issues. We choose a conservative periodic check to ease the error handling code in other code paths for simplicity. I am sure there are other design choices of handling these problems without periodic check.

Note that debugging the inconsistency problem in production environment will be extremely difficult and nightmare for SRE hence the periodic check will be our last resort. We do have metrics/logs to record inconsistent objects for further investigation.

vincent-pli commented 3 years ago

Thanks @Fei-Guo , very useful, I will keep the issue open I guess people may has same question.

christopherhein commented 3 years ago

@vincent-pli would you like to turn this into a PR with these notes written up in the virtualcluster readme?

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-nested/issues/208#issuecomment-1030712706): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.