Tigera Operator installation causing significant growth in kube-apiserver-audit and operator workload logs

play-io commented 7 months ago

Upon installing Tigera Operator in my EKS cluster on AWS with kube-apiserver-audit logs enabled, I noticed a significant increase in log volume. These logs are being pushed to CloudWatch, leading to an unexpected increase in billing. The primary concerns are twofold:

The continued operations on Kubernetes resources seem to be extending the kube-apiserver-audit logs, causing them to grow substantially. The ongoing reconciliation performed by the Tigera Operator appears to be generating additional logs, which are also contributing to the increased log volume. I seek clarification on whether this behavior is expected or if it indicates a problem with the Tigera Operator, potentially causing the observed increase in log volume and associated billing.

Expected Behavior

The Tigera Operator should perform operations on Kubernetes resources (such as GET, UPDATE, etc.) as needed and should reconcile (as indicated by "msg":"Reconciling Installation.operator.tigera.io") when necessary.

Current Behavior

The current behavior observed after installing the Tigera Operator involves extensive logging and frequent access to Kubernetes resources. This includes operations such as GET and UPDATE on Kubernetes resources, as well as continuous reconciliations performed by the operator. However, it remains uncertain whether this level of activity is expected behavior or indicative of a potential issue within the installation. Clarification is sought to determine if the observed behavior aligns with the intended functionality of the Tigera Operator or if it signifies an anomaly requiring further investigation.

Possible Solution

n/a

Steps to Reproduce (for bugs)

Install Tiger Operator v1.32.3

Context

AWS Cloudwatch is updated with high frequency with messages like:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"RequestResponse","auditID":"a4f1f1fd-fd58-4431-a833-6a8b0e0bc133","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/calico-system/deployments/calico-kube-controllers","verb":"update","user":{"username":"system:serviceaccount:tigera-operator:tigera-ope
...

Operator workload's log is updated with high frequency with messages like:

{"level":"info","ts":"2024-04-05T11:34:36Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-kube-controllers"}
{"level":"info","ts":"2024-04-05T11:34:36Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-typha"}
{"level":"info","ts":"2024-04-05T11:34:37Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"csi-node-driver"}
{"level":"info","ts":"2024-04-05T11:34:37Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-cni-plugin"}
{"level":"info","ts":"2024-04-05T11:34:38Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-node"}
{"level":"info","ts":"2024-04-05T11:34:38Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-kube-controllers"}
{"level":"info","ts":"2024-04-05T11:34:38Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-typha"}
{"level":"info","ts":"2024-04-05T11:34:39Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"csi-node-driver"}

Your Environment

AWS EKS v1.24

tmjd commented 7 months ago

Can you clarify how often you see the Operator reconciling? Is it reconciling constantly? I'm looking at a cluster with operator v1.33.0 and see reconciles every 5 minutes with a request name periodic-5m0s-reconcile-event. I doubt there are update calls that are triggered from that but I haven't specifically checked.

If you are seeing reconciliation consistently I'd wonder if you are using AddonManager or something similar that is managing a resource that the operator watches and if that resource is constantly being updated triggering changes. For example the operator writes some default values into the Installation CR, I'm wondering if perhaps you have something that manages the Installation CR and is removing those default values (updating the CR) which would trigger the operator to reconcile again. Another option that could perhaps trigger reconcile is if you have something that is watching and updating deployments/daemonsets and the operator keeps reconciling those changes away and fighting with whatever is modifying the Calico resources.

I'd suggest sharing a larger snippet of the operator logs if my previous comments don't help identify the issue.

LaikaN57 commented 1 month ago

@tmjd Below are graphs for read and write events for namespaces. I am still digging into what the event was where we see the increase.

cc: @diranged @scohen-nd

Reads:

Writes:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "14"
    kubectl.kubernetes.io/last-applied-configuration: |
      [REDACTED]
    policies.kyverno.io/patches: |
      require-app-label.set-default-app-labels.kyverno.io: added /spec/template/metadata/labels/app
  creationTimestamp: "2021-05-04T19:21:22Z"
  generation: 24
  labels:
    cfn_version: "1.27"
    k8s-app: tigera-operator
  name: tigera-operator
  namespace: tigera-operator
  resourceVersion: "13011510649"
  uid: 4aaf443d-e297-4f30-9d60-d2a853ded0f4
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: tigera-operator
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2022-08-26T14:11:09-07:00"
      creationTimestamp: null
      labels:
        app: tigera-operator
        k8s-app: tigera-operator
        name: tigera-operator
    spec:
      containers:
      - command:
        - operator
        env:
        - name: WATCH_NAMESPACE
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: OPERATOR_NAME
          value: tigera-operator
        - name: TIGERA_OPERATOR_INIT_IMAGE_VERSION
          value: v3.26.4
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: [REDACTED].dkr.ecr.us-west-2.amazonaws.com/quay-io/tigera/operator:v1.30.10
        imagePullPolicy: IfNotPresent
        name: tigera-operator
        resources:
          limits:
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 384Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/calico
          name: var-lib-calico
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      nodeSelector:
        kubernetes.io/os: linux
        [REDACTED].group-name: kube-system
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: tigera-operator
      serviceAccountName: tigera-operator
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: [REDACTED].group-name
        operator: Equal
        value: kube-system
      volumes:
      - hostPath:
          path: /var/lib/calico
          type: ""
        name: var-lib-calico
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-08-16T23:40:35Z"
    lastUpdateTime: "2024-07-21T14:39:14Z"
    message: ReplicaSet "tigera-operator-78b6d57c44" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-10-04T16:48:31Z"
    lastUpdateTime: "2024-10-04T16:48:31Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 24
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    created_by: user
    group: system:masters
    kubectl.kubernetes.io/last-applied-configuration: |
      [REDACTED]
    owner: kubernetes-admin
    policies.kyverno.io/patches: |
      add-ns-owner-annotations.annotate-namespaces-with-owner.kyverno.io: added /metadata/annotations/owner
  creationTimestamp: "2021-08-16T23:40:22Z"
  labels:
    cfn_version: "1.27"
    kubernetes.io/metadata.name: calico-system
    name: calico-system
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: privileged
  name: calico-system
  ownerReferences:
  - apiVersion: operator.tigera.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Installation
    name: default
    uid: 5956e270-2844-4c79-830d-ebcc9258268e
  resourceVersion: "9451738645"
  uid: b1a4c8e3-b3a2-4ef0-9b12-c99513f45622
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

tigera / operator