kubernetes-sigs / cluster-api-operator

Home for Cluster API Operator, a subproject of sig-cluster-lifecycle
https://cluster-api-operator.sigs.k8s.io
Apache License 2.0
164 stars 77 forks source link

Automatic deletion of all providers #562

Open dtzar opened 3 months ago

dtzar commented 3 months ago

What steps did you take and what happened: Install helm chart using ArgoCD and it starts to do the install and then for some unknown reason the capi-operator-system deletes everything.

What did you expect to happen: capi-operator-system doesn't initiate deletion

I0705 19:26:02.047065       1 phases.go:501] "Installing provider" controller="infrastructureprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="InfrastructureProvider" InfrastructureProvider="azure-infrastructure-system/azure" namespace="azure-infrastructure-system" name="azure" reconcileID="f522a400-3574-49f2-a2f9-f5cd647d1353"
I0705 19:26:08.481831       1 genericprovider_controller.go:62] "Reconciling provider" controller="bootstrapprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="BootstrapProvider" BootstrapProvider="capi-kubeadm-bootstrap-system/kubeadm" namespace="capi-kubeadm-bootstrap-system" name="kubeadm" reconcileID="55d807bc-19a8-4fd9-a894-9586199f2dda"
I0705 19:26:08.482636       1 genericprovider_controller.go:190] "Deleting provider resources" controller="bootstrapprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="BootstrapProvider" BootstrapProvider="capi-kubeadm-bootstrap-system/kubeadm" namespace="capi-kubeadm-bootstrap-system" name="kubeadm" reconcileID="55d807bc-19a8-4fd9-a894-9586199f2dda"
I0705 19:26:08.482750       1 phases.go:547] "Deleting provider" controller="bootstrapprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="BootstrapProvider" BootstrapProvider="capi-kubeadm-bootstrap-system/kubeadm" namespace="capi-kubeadm-bootstrap-system" name="kubeadm" reconcileID="55d807bc-19a8-4fd9-a894-9586199f2dda"
E0705 19:26:09.122056       1 controller.go:329] "Reconciler error" err="failed to patch BootstrapProvider capi-kubeadm-bootstrap-system/kubeadm: bootstrapproviders.operator.cluster.x-k8s.io \"kubeadm\" not found" controller="bootstrapprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="BootstrapProvider" BootstrapProvider="capi-kubeadm-bootstrap-system/kubeadm" namespace="capi-kubeadm-bootstrap-system" name="kubeadm" reconcileID="55d807bc-19a8-4fd9-a894-9586199f2dda"
I0705 19:26:09.122162       1 genericprovider_controller.go:62] "Reconciling provider" controller="bootstrapprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="BootstrapProvider" BootstrapProvider="capi-kubeadm-bootstrap-system/kubeadm" namespace="capi-kubeadm-bootstrap-system" name="kubeadm" reconcileID="75d274a9-98dd-448b-9c9e-4f2479e1b029"
I0705 19:26:09.128137       1 genericprovider_controller.go:62] "Reconciling provider" controller="bootstrapprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="BootstrapProvider" BootstrapProvider="capi-kubeadm-bootstrap-system/kubeadm" namespace="capi-kubeadm-bootstrap-system" name="kubeadm" reconcileID="8d54cb3f-d8c6-4d4e-a649-64077a1aba06"
I0705 19:26:10.401370       1 genericprovider_controller.go:62] "Reconciling provider" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="36eb8460-b8a2-4558-b378-14522baecb48"
I0705 19:26:10.402428       1 genericprovider_controller.go:190] "Deleting provider resources" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="36eb8460-b8a2-4558-b378-14522baecb48"
I0705 19:26:10.402484       1 phases.go:547] "Deleting provider" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="36eb8460-b8a2-4558-b378-14522baecb48"
I0705 19:26:10.845483       1 genericprovider_controller.go:62] "Reconciling provider" controller="addonprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="AddonProvider" AddonProvider="helm-addon-system/helm" namespace="helm-addon-system" name="helm" reconcileID="bab88566-fb6a-4672-bd9a-beedab06d9e1"
I0705 19:26:10.845971       1 genericprovider_controller.go:190] "Deleting provider resources" controller="addonprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="AddonProvider" AddonProvider="helm-addon-system/helm" namespace="helm-addon-system" name="helm" reconcileID="bab88566-fb6a-4672-bd9a-beedab06d9e1"
I0705 19:26:10.846008       1 phases.go:547] "Deleting provider" controller="addonprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="AddonProvider" AddonProvider="helm-addon-system/helm" namespace="helm-addon-system" name="helm" reconcileID="bab88566-fb6a-4672-bd9a-beedab06d9e1"
E0705 19:26:11.045291       1 controller.go:329] "Reconciler error" err="failed to patch ControlPlaneProvider capi-kubeadm-control-plane-system/kubeadm: controlplaneproviders.operator.cluster.x-k8s.io \"kubeadm\" not found" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="36eb8460-b8a2-4558-b378-14522baecb48"
I0705 19:26:11.045364       1 genericprovider_controller.go:62] "Reconciling provider" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="22fad70b-6283-4c75-a579-bc27750c5d47"
I0705 19:26:11.053119       1 genericprovider_controller.go:62] "Reconciling provider" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="48e498c2-4bba-4fa2-a9ac-2732d93e5144"
I0705 19:26:11.197563       1 genericprovider_controller.go:62] "Reconciling provider" controller="coreprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="CoreProvider" CoreProvider="capi-system/cluster-api" namespace="capi-system" name="cluster-api" reconcileID="9c377523-7cb2-4890-8a98-ee8fbd842ec7"
I0705 19:26:11.197803       1 genericprovider_controller.go:190] "Deleting provider resources" controller="coreprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="CoreProvider" CoreProvider="capi-system/cluster-api" namespace="capi-system" name="cluster-api" reconcileID="9c377523-7cb2-4890-8a98-ee8fbd842ec7"
I0705 19:26:11.197824       1 phases.go:547] "Deleting provider" controller="coreprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="CoreProvider" CoreProvider="capi-system/cluster-api" namespace="capi-system" name="cluster-api" reconcileID="9c377523-7cb2-4890-8a98-ee8fbd842ec7"

Environment:

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api-operator/labels?q=area for the list of labels]

k8s-ci-robot commented 3 months ago

This issue is currently awaiting triage.

If CAPI Operator contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
guettli commented 3 months ago

We see something which could be related to that. The CRD controlplaneproviders got deleted and is now in that state for hours.

We have no clue why this happened.

@dtzar were you able to solve that?

guettli commented 3 months ago

In our case argoCD was OutOfSync because the ca-bundle was injected.

dtzar commented 3 months ago

We install certmanager separately and have a sleep hook which checks to make sure it is available before the install. Then we've tried various ways (raw manifest YAML and helm-chart) to do the install to no avail. The thing which is most painful is the same exact configuration one time will work and another time it will fail. There is some type of race-condition happening which puts things into this state. Lost many painful hours to this problem at this point and still not sure of what the cause is. I haven't been able to reproduce it without ArgoCD, but ArgoCD should NOT do anything as far as I can see with automatically deleting anything. It should only keep trying to apply the YAML there until it matches what it sees in git.

dtzar commented 3 months ago

It seems that for some reason the namespace are getting deleted, but I can't understand from where.

From ArgoCD UI:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    argocd.argoproj.io/sync-options: Prune=false
    argocd.argoproj.io/tracking-id: 'addon-gitops-aks-capi-operator:/Namespace:capi-operator-system/capi-system'
    helm.sh/hook: post-install
    helm.sh/hook-weight: '1'
  creationTimestamp: '2024-07-08T20:27:35Z'
  deletionTimestamp: '2024-07-08T20:28:18Z'
  labels:
    kubernetes.io/metadata.name: capi-system
  name: capi-system
  resourceVersion: '51476'
  uid: c7b7379d-2368-4ea3-9fe0-fa7d9e42bbac
spec:
  finalizers:
    - kubernetes
status:
  conditions:
    - lastTransitionTime: '2024-07-08T20:28:27Z'
      message: All resources successfully discovered
      reason: ResourcesDiscovered
      status: 'False'
      type: NamespaceDeletionDiscoveryFailure
    - lastTransitionTime: '2024-07-08T20:28:27Z'
      message: All legacy kube types successfully parsed
      reason: ParsedGroupVersions
      status: 'False'
      type: NamespaceDeletionGroupVersionParsingFailure
    - lastTransitionTime: '2024-07-08T20:28:32Z'
      message: >-
        Failed to delete all resource types, 2 remaining: Internal error
        occurred: error resolving resource, Internal error occurred: error
        resolving resource
      reason: ContentDeletionFailed
      status: 'True'
      type: NamespaceDeletionContentFailure
    - lastTransitionTime: '2024-07-08T20:28:32Z'
      message: All content successfully removed
      reason: ContentRemoved
      status: 'False'
      type: NamespaceContentRemaining
    - lastTransitionTime: '2024-07-08T20:28:32Z'
      message: All content-preserving finalizers finished
      reason: ContentHasNoFinalizers
      status: 'False'
      type: NamespaceFinalizersRemaining
  phase: Terminating

From the cluster itself:

NAME                                STATUS        AGE
argo-events                         Active        144m
argo-rollouts                       Active        144m
argo-workflows                      Active        144m
argocd                              Active        145m
azure-infrastructure-system         Terminating   12m
capi-kubeadm-bootstrap-system       Terminating   12m
capi-kubeadm-control-plane-system   Terminating   12m
capi-operator-system                Active        143m
capi-system                         Terminating   12m
cert-manager                        Active        144m
crossplane-system                   Active        145m
default                             Active        150m
helm-addon-system                   Terminating   12m
kube-node-lease                     Active        150m
kube-public                         Active        150m
kube-system                         Active        150m
workload                            Active        144m

From the capi-operator-system log:

E0708 20:28:25.368707       1 controller.go:329] "Reconciler error" err="failed to create config map for provider \"kubeadm\": configmaps \"controlplane-kubeadm-v1.7.3\" is forbidden: unable to create new content in namespace capi-kubeadm-control-plane-system because it is being terminated" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="55fb76de-9194-434b-bd3c-03c8c07445ed"
I0708 20:28:25.368868       1 genericprovider_controller.go:62] "Reconciling provider" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="e5bc03d9-3c4c-4619-9b2f-7cef46e6e3cf"
I0708 20:28:25.369043       1 preflight_checks.go:58] "Performing preflight checks" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="e5bc03d9-3c4c-4619-9b2f-7cef46e6e3cf"
I0708 20:28:25.369288       1 preflight_checks.go:199] "Preflight checks passed" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="e5bc03d9-3c4c-4619-9b2f-7cef46e6e3cf"
I0708 20:28:25.369509       1 phases.go:240] "No configuration secret was specified" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="e5bc03d9-3c4c-4619-9b2f-7cef46e6e3cf"
I0708 20:28:26.075335       1 genericprovider_controller.go:62] "Reconciling provider" controller="coreprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="CoreProvider" CoreProvider="capi-system/cluster-api" namespace="capi-system" name="cluster-api" reconcileID="e0c1d4ba-c81a-4a1d-ba87-d6eff3892d91"
I0708 20:28:26.164419       1 genericprovider_controller.go:190] "Deleting provider resources" controller="coreprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="CoreProvider" CoreProvider="capi-system/cluster-api" namespace="capi-system" name="cluster-api" reconcileID="e0c1d4ba-c81a-4a1d-ba87-d6eff3892d91"
I0708 20:28:26.164461       1 phases.go:547] "Deleting provider" controller="coreprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="CoreProvider" CoreProvider="capi-system/cluster-api" namespace="capi-system" name="cluster-api" reconcileID="e0c1d4ba-c81a-4a1d-ba87-d6eff3892d91"
I0708 20:28:26.770877       1 manifests_downloader.go:80] "Downloading provider manifests" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="e5bc03d9-3c4c-4619-9b2f-7cef46e6e3cf"
I0708 20:28:26.966645       1 healthcheck_controller.go:122] "Checking provider health" controller="deployment" controllerGroup="apps" controllerKind="Deployment" Deployment="capi-system/capi-controller-manager" namespace="capi-system" name="capi-controller-manager" reconcileID="e6a6c932-b3a5-411a-95dc-c62ba148a5ce"
E0708 20:28:26.975151       1 controller.go:329] "Reconciler error" err="failed to create config map for provider \"kubeadm\": configmaps \"controlplane-kubeadm-v1.7.3\" is forbidden: unable to create new content in namespace capi-kubeadm-control-plane-system because it is being terminated" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="e5bc03d9-3c4c-4619-9b2f-7cef46e6e3cf"
I0708 20:28:26.975293       1 genericprovider_controller.go:62] "Reconciling provider" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="6d512c3a-7fe7-475a-8db7-436f5d876508"
I0708 20:28:26.975740       1 genericprovider_controller.go:190] "Deleting provider resources" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="6d512c3a-7fe7-475a-8db7-436f5d876508"
I0708 20:28:26.975814       1 phases.go:547] "Deleting provider" controller="controlplaneprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="ControlPlaneProvider" ControlPlaneProvider="capi-kubeadm-control-plane-system/kubeadm" namespace="capi-kubeadm-control-plane-system" name="kubeadm" reconcileID="6d512c3a-7fe7-475a-8db7-436f5d876508"
E0708 20:28:27.408965       1 controller.go:329] "Reconciler error" err="failed to create config map for provider \"azure\": configmaps \"infrastructure-azure-v1.15.2\" is forbidden: unable to create new content in namespace azure-infrastructure-system because it is being terminated" controller="infrastructureprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="InfrastructureProvider" InfrastructureProvider="azure-infrastructure-system/azure" namespace="azure-infrastructure-system" name="azure" reconcileID="3eed34cb-ae60-4e31-a2d2-62b3e9aeab40"
I0708 20:28:27.409030       1 genericprovider_controller.go:62] "Reconciling provider" controller="infrastructureprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="InfrastructureProvider" InfrastructureProvider="azure-infrastructure-system/azure" namespace="azure-infrastructure-system" name="azure" reconcileID="950f078b-c43d-4d94-b49e-662eb491a1e3"
I0708 20:28:27.409174       1 genericprovider_controller.go:190] "Deleting provider resources" controller="infrastructureprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="InfrastructureProvider" InfrastructureProvider="azure-infrastructure-system/azure" namespace="azure-infrastructure-system" name="azure" reconcileID="950f078b-c43d-4d94-b49e-662eb491a1e3"
I0708 20:28:27.409192       1 phases.go:547] "Deleting provider" controller="infrastructureprovider" controllerGroup="operator.cluster.x-k8s.io" controllerKind="InfrastructureProvider" InfrastructureProvider="azure-infrastructure-system/azure" namespace="azure-infrastructure-system" name="azure" reconcileID="950f078b-c43d-4d94-b49e-662eb491a1e3"
dtzar commented 2 months ago

When ArgoCD deploys the raw YAML from this:

helm template capi-operator capi-operator/cluster-api-operator --create-namespace -n capi-operator-system \
--set core="cluster-api:v1.7.4" 

It works (doesn't auto-delete) with one minor error message (but that capi-operator-system namespace DOES exist and has a pod there):

Screenshot 2024-07-22 151840

However, as soon as I add in the infrastructure provider (just the diff in YAML from this command), the behavior comes back.

helm template capi-operator capi-operator/cluster-api-operator --create-namespace -n capi-operator-system \
--set core="cluster-api:v1.7.4" \
--set infrastructure="azure:v1.16.0" 
dtzar commented 2 months ago

I updated the RAW YAML output to 0.12.0 release. This was generated from helm template capi-operator capi-operator/cluster-api-operator --create-namespace -n capi-operator-system --set infrastructure="azure:v1.16.0" --set addon="helm:v0.2.4" --set core="cluster-api:v1.7.4" --set manager.featureGates.core.MachinePool="true" --set manager.featureGates.azure.MachinePool="true"

Good news - it doesn't automatically terminate ALL of the namespaces, now it only terminates the azure-infrastructure-system namespace.

Bad news - the only pods persistently running is the capi-operator. I can see it exhibits a similar behavior where it creates all the other required pods (helm-addon, kubeadm, etc.) and then they get removed somehow (even though those other namespaces are not in terminating state anymore like on 0.11.0 release). After those pods go away, they are never brought back.

I am attaching the full raw capi-operator log for reference. capioperator.log

You can reproduce this behavior 100% of the time - see instructions in the linked issue.

dtzar commented 3 weeks ago

Just tested this again with the latest CAPI-Operator release and latest version of ArgoCD. The pre-sync hook still stalls indefinitely, so I must terminate the sync to make the cluster API operator deployment move forward. I'm not sure if this is what is making all the namespaces get deleted or not (because they don't even exist yet when I terminate the sync job). But doing K8s API server audit logs I can confirm the deletion is from ArgoCD. Open to ideas on how to make ArgoCD NOT delete things and have the install work as well as not have the indefinite sleep hook problem See linked issue on ArgoCD.

{"username":"system:serviceaccount:argocd:argocd-application-controller","uid":"8ddf813e-5af5-4b02-bb20-ddd9044fc2f1","groups":["system:serviceaccounts","system:serviceaccounts:argocd","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["argo-cd-argocd-application-controller-0"],"authentication.kubernetes.io/pod-uid":["e20203fe-af97-489f-a64f-23233e4c0097"]}}

{"kind":"DeleteOptions","apiVersion":"meta.k8s.io/__internal","propagationPolicy":"Foreground"}

{"kind":"Namespace","apiVersion":"v1","metadata":{"name":"capi-system","uid":"acd052a8-6211-4b02-b0c2-a468ae37bdf1","resourceVersion":"447914","creationTimestamp":"2024-09-13T19:06:10.0000000Z","deletionTimestamp":"2024-09-13T19:06:47.0000000Z","labels":{"kubernetes.io/metadata.name":"capi-system"},"annotations":{"argocd.argoproj.io/tracking-id":"addon-gitops-aks-capi-operator:/Namespace:capi-operator-system/capi-system","helm.sh/hook":"post-install","helm.sh/hook-weight":"1"},"finalizers":["foregroundDeletion"],"managedFields":[{"manager":"argocd-controller","operation":"Apply","apiVersion":"v1","time":"2024-09-13T19:06:10.0000000Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{"f:argocd.argoproj.io/tracking-id":{},"f:helm.sh/hook":{},"f:helm.sh/hook-weight":{}}}}}]},"spec":{"finalizers":["kubernetes"]},"status":{"phase":"Terminating"}}

{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"argo-cd-argocd-application-controller\" of ClusterRole \"argo-cd-argocd-application-controller\" to ServiceAccount \"argocd-application-controller/argocd\""}

K8s-API-Logs-ArgoCDRepro.xlsx

dtzar commented 3 weeks ago

Removed the sleep sync hook, tweaked cert-manager to be before all the other apps, and added the delete=false annotation to the namespaces see changes here.

CAPI-Operator unfortunately fails due to it not having the proper certs. I did see that the 60 second delay took effect AND that CAPI-Operator didn't even start to try and install until cert-manager pods said they were ready. When I manually applied the cert-manager related YAML from the CAPI-operator chart, deleted the capi-operator-controller pod, capi-operator started successfully and then ArgoCD did its job (using sync-waves) to try to create the namespaces next. Namespaces get created, then you can see ArgoCD trying to create some additional things and then BAM namespaces terminated. I setup ArgoCD debug logging before the deletion and captured the logs through the deletion process. Unfortunately, there is not an obvious culprit I can see anyways. Attached logs if you're curious.

appset-controller.log app-controller.log

guettli commented 2 weeks ago

Have you tried PrunePropagationPolicy "background"?