projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.97k stars 1.33k forks source link

Tigera Operator helm chart not working with argo cd when specifying a private registry #7121

Open snakebyte91 opened 1 year ago

snakebyte91 commented 1 year ago

Expected Behavior

Helm chart resources are not modified after deployment.

Current Behavior

Tigera operator modifies the Installation resource and argo cd will immediately revert the changes. This results in a not healthy apiservice.

Steps to Reproduce (for bugs)

  1. Deploy tigera-operator helm chart with argo cd and add the attribute installation.registry

Context

Use calico images from a private registry

Your Environment

Helm chart version: 3.24.5 Kubernetes version: 1.22

installation:
  enabled: true
  kubernetesProvider: EKS
  installation:
    registry: <private-registry>
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
....
caseydavenport commented 1 year ago

Tigera operator modifies the Installation resource and argo cd will immediately revert the changes

Do you have visibility into what types of modifications are being made?

The operator will default some fields if they are unset and write them back, which is what I suspect is happening.

Josh-Tigera commented 1 year ago

What version of the operator are you running?

snakebyte91 commented 1 year ago

I think I found the problem. I did not add the "/" at the end of the registry. So the tigera-operator changed the value for registry to <account_id>.dkr.ecr.eu-central-1.amazonaws.com/ and Argo changed it back to <account_id>.dkr.ecr.eu-central-1.amazonaws.com. This happened in an endless loop.

The following values file is working with Argo CD now:

installation:
  enabled: true
  kubernetesProvider: EKS
  registry: <account_id>.dkr.ecr.eu-central-1.amazonaws.com/

Desired Manifest in Argo CD:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  labels:
    argocd.argoproj.io/instance: dev-helm-tigera-operator
  name: default
spec:
  imagePullSecrets: []
  kubernetesProvider: EKS
  registry: <account_id>.dkr.ecr.eu-central-1.amazonaws.com/

Live Manifest in Argo CD:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"operator.tigera.io/v1","kind":"Installation","metadata":{"annotations":{},"labels":{"argocd.argoproj.io/instance":"dev-helm-tigera-operator"},"name":"default"},"spec":{"imagePullSecrets":[],"kubernetesProvider":"EKS","registry":"838509663246.dkr.ecr.eu-central-1.amazonaws.com/"}}
  creationTimestamp: '2022-12-22T09:11:58Z'
  finalizers:
    - tigera.io/operator-cleanup
  generation: 4043
  labels:
    argocd.argoproj.io/instance: dev-helm-tigera-operator
  name: default
  resourceVersion: '81619299'
  uid: d90cad43-8a65-48ce-ab13-ba76b8cc4755
spec:
  calicoNetwork:
    bgp: Disabled
    linuxDataplane: Iptables
  cni:
    ipam:
      type: AmazonVPC
    type: AmazonVPC
  controlPlaneReplicas: 2
  flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
  imagePullSecrets: []
  kubeletVolumePluginPath: /var/lib/kubelet
  kubernetesProvider: EKS
  nodeUpdateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  nonPrivileged: Disabled
  registry: 838509663246.dkr.ecr.eu-central-1.amazonaws.com/
  variant: Calico
status:
  computed:
    calicoNetwork:
      bgp: Disabled
      linuxDataplane: Iptables
    cni:
      ipam:
        type: AmazonVPC
      type: AmazonVPC
    controlPlaneReplicas: 2
    flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
    kubeletVolumePluginPath: /var/lib/kubelet
    kubernetesProvider: EKS
    nodeUpdateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
    nonPrivileged: Disabled
    registry: <account_id>.dkr.ecr.eu-central-1.amazonaws.com/
    variant: Calico
  conditions:
    - lastTransitionTime: '2023-01-12T16:18:51Z'
      message: All Objects Available
      observedGeneration: 4043
      reason: AllObjectsAvailable
      status: 'False'
      type: Degraded
    - lastTransitionTime: '2023-01-12T16:18:51Z'
      message: All objects available
      observedGeneration: 4043
      reason: AllObjectsAvailable
      status: 'True'
      type: Ready
    - lastTransitionTime: '2023-01-12T16:18:51Z'
      message: All Objects Available
      observedGeneration: 4043
      reason: AllObjectsAvailable
      status: 'False'
      type: Progressing
  mtu: 9001
  variant: Calico

I think the tigera-operator adds attributes like spec.calicoNetwork and spec.cni

Helm chart version 3.24.5 is using the image tigera/operator:v1.28.5

caseydavenport commented 1 year ago

So the tigera-operator changed the value for registry to .dkr.ecr.eu-central-1.amazonaws.com/ and Argo changed it back to .dkr.ecr.eu-central-1.amazonaws.com. This happened in an endless loop.

This isn't the first time I've seen this - we should probably adjust the operator logic so that we don't write the / back to that field.

We still need to add it internally if needed, but the problem is that we try to "correct" the input config, causing this loop.

Mohsen51 commented 1 year ago

Hi ! I have got the same issue here. I didn't mention the final "/" in my values file for a calico deployment using helm. The exact issue is Calico-apiserver pods (only them) were endlessly terminating and re-starting. Half of the time, they were using the right image path and the other half of the time were trying to use an image from my registry with a path without the final "/". The necessity of adding the final "/" in the values.yaml file is not at all mention in the documentation (eg: https://github.com/projectcalico/calico/blob/master/charts/tigera-operator/values.yaml#L45). At least while this issue stays open and unfixed it could be nice to mention it somewhere in the documentation to avoid having to debug it.

ZorkMaster commented 1 year ago

@Mohsen51 The "/" requirement is mentioned in the Calico docs, albeit somewhat buried in the API documentation.

I agree though, this rather unique requirement should be added as a comment to the values.yaml file.