kubernetes-retired / cluster-api-provider-nested

Cluster API Provider for Nested Clusters
Apache License 2.0
301 stars 67 forks source link

Resource already exists and the UID is different should not requeue #293

Closed wondywang closed 1 year ago

wondywang commented 2 years ago

What would you like to be added? When the DWS synchronization resource is encountered but its delegated object UID is different in the reconcileXXXCreate process, it should not re-enter the queue, because the event will also not be processed in the subsequent process.

Currently:

    if errors.IsAlreadyExists(err) {
        if pPod.Annotations[constants.LabelUID] == requestUID {
            klog.Infof("pod %s/%s of cluster %s already exist in super master", targetNamespace, pPod.Name, clusterName)
            return nil
        } else {
            return fmt.Errorf("pPod %s/%s exists but the UID is different from tenant master.", targetNamespace, pPod.Name)
        }
    }

Expect:

    if errors.IsAlreadyExists(err) {
        if pPod.Annotations[constants.LabelUID] == requestUID {
            klog.Infof("pod %s/%s of cluster %s already exist in super master", targetNamespace, pPod.Name, clusterName)
            return nil
        } else {
            klog.Errorf("pPod %s/%s exists but the UID is different from tenant master.", targetNamespace, pPod.Name)
            return nil
        }
    }

Why is this needed? If the queue is not dequeued so early, the abnormal tasks will continue to accumulate until they exceed the maximum number of retries MaxReconcileRetryAttempts will be discarded.

/kind bug

wondywang commented 2 years ago

/assign @Fei-Guo /assign @christopherhein

Fei-Guo commented 2 years ago

The UID mismatch can happen in certain cases such as the VC create/delete the object using the same name or inconsistency caused by syncer restarts. The Patroller will fix the mismatch by removing the stale object in the super cluster. You can skip requeue since the problem should be fixed by the patroller which runs every minute. I implemented with the retry for UID mismatch in DWS to target the case where VC create/delete the object using the same name frequently and the mismatch should be gone quickly.

wondywang commented 2 years ago

Thank Fei. Yes, the patroller will fix the missing resource, but it will also ignore the UID differences.

wondywang commented 2 years ago

It's some feature gates of kubernetes 1.20. It will auto-gen an configmap named kube-root-ca.crt under each namespace both on tenant and super cluster.

Fei-Guo commented 2 years ago

Thank Fei. Yes, the patroller will fix the missing resource, but it will also ignore the UID differences.

We do clear the mismatch in super (the way we handle Pod can be different) https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/6428e489de2c17ebac96c1f41c0b7a33e0f118b9/virtualcluster/pkg/syncer/resources/service/checker.go#L106

Fei-Guo commented 2 years ago

kube-root-ca.crt

Now I understand the problem. Do you know how this configmap is commonly used? If user wants to use the configmap in VC namespace in the pod, we have to rename it in super cluster and change the pod spec unfortunately. By the meantime, you can change the dws to skip this specific cfgmap to avoid the error.

wondywang commented 2 years ago

kube-root-ca.crt

As I known, the feature gates is RootCAConfigMap, https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#service-account-automation.

Thanks Fei, I've made a change, please check it out.

christopherhein commented 1 year ago

/reopen

@wondywang @Fei-Guo I'm going to reopen this issue, upstream removed the feature gate for this and made it locked on in 1.21, so you can't disable it anymore. This means the syncer gets locked up with tenant or supers at 1.21+ and default configs for 1.20.

k8s-ci-robot commented 1 year ago

@christopherhein: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-nested/issues/293#issuecomment-1306640202): >/reopen > >@wondywang @Fei-Guo I'm going to reopen this issue, upstream removed the feature gate for this and made it locked on in 1.21, so you can't disable it anymore. This means the syncer gets locked up with tenant or supers at 1.21+ and default configs for 1.20. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
christopherhein commented 1 year ago

I would like to introduce special handling for this since it's not possible to disable it at all. What I am thinking is to keep support across the board for this when we get a kube-root-ca.crt find DWS we add a resource named tenant-kube-root-ca.crt instead, then in Syncing Pods any ENV or Volume with this referenced change it to rename to tenant-kube-root-ca.crt. Lastly, in the checker disable any checkers for kube-root-ca.crt in the super to stop if from trying to delete this object.