open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.2k stars 438 forks source link

Otel collector created using otel operator not setting hpa memory utilization config correctly #3283

Closed shine17 closed 1 month ago

shine17 commented 1 month ago

Component(s)

collector

What happened?

Description

Otel collector created using otel operator not setting hpa memory utilization config correctly

Steps to Reproduce

Deploy otel operator. Create otel collector deployment object with min of 3 replicas and max of 6 replicas

replicas : {{ .Values.minReplicaCount }}
resources:
    limits:
      cpu: 100m
      memory: 1024Mi
      # ephemeral-storage: 50Mi
    requests:
      cpu: 100m
      memory: 64Mi
autoscaler:
    minReplicas: {{ .Values.minReplicaCount }}
    maxReplicas: {{ .Values.maxReplicaCount }}
    targetCPUUtilization: 80
    targetMemoryUtilization: 65
    behavior:
      scaleDown:
        policies:
        - periodSeconds: 600
          type: Pods
          value: 1
        selectPolicy: Min
        stabilizationWindowSeconds: 900
      scaleUp:
        policies:
        - periodSeconds: 60
          type: Pods
          value: 2
        - periodSeconds: 60
          type: Percent
          value: 100
        selectPolicy: Max
        stabilizationWindowSeconds: 60

The targetMemoryUtilization is not honored and hpa always scale the collector pods although the memory utilization is less than 30 percent of the limit for each collector pods.

pod memory data -

NAME CPU(cores) MEMORY(bytes) otel-gateway-collector-7898f79fdd-27l9j 1m 55Mi

hpa data - otel-gateway-collector OpenTelemetryCollector/otel-gateway 112%/65%, 4%/80% 3 6 6 106m

Expected Result

Scaling should happen only based on targetMemoryUtilization percentage.

Actual Result

Scaling happens since it calculates targetMemoryUtilization incorrectly.

Also please provide test cases for targetMemoryUtilization in the repo. I don't find test cases for targetMemoryUtilization in the repo https://github.com/open-telemetry/opentelemetry-operator/tree/main/tests/e2e-autoscale/autoscale

Kubernetes Version

1.29.7

Operator version

0.108.0

Collector version

0.109.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

Log output

No response

Additional context

No response

jaronoff97 commented 1 month ago

can you share the generated HPA resource?

shine17 commented 1 month ago

can you share the generated HPA resource?

@jaronoff97 this is the hpa configuration in my deployment yaml

 replicas : {{ .Values.minReplicaCount }}
  autoscaler:
    minReplicas: {{ .Values.minReplicaCount }}
    maxReplicas: {{ .Values.maxReplicaCount }}
    targetCPUUtilization: 80
    targetMemoryUtilization: 65
    behavior:
      scaleDown:
        policies:
        - periodSeconds: 600
          type: Pods
          value: 1
        selectPolicy: Min
        stabilizationWindowSeconds: 900
      scaleUp:
        policies:
        - periodSeconds: 60
          type: Pods
          value: 2
        - periodSeconds: 60
          type: Percent
          value: 100
        selectPolicy: Max
        stabilizationWindowSeconds: 60
  securityContext:
    allowPrivilegeEscalation: false
    privileged: false
    readOnlyRootFilesystem: true
  resources:
    limits:
      cpu: 1000m
      memory: 1024Mi
    requests:
      cpu: 50m
      memory: 64Mi

Below is the generated hpa yaml

kubectl get hpa   otel-gateway-collector -n monitoringapps 

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: otel-gateway-deployment
    meta.helm.sh/release-namespace: monitoringapps
  creationTimestamp: "2024-09-14T06:55:59Z"
  labels:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: monitoringapps.otel-gateway
    app.kubernetes.io/managed-by: opentelemetry-operator
    app.kubernetes.io/name: otel-gateway-collector
    app.kubernetes.io/part-of: opentelemetry
    app.kubernetes.io/version: latest
  name: otel-gateway-collector
  namespace: monitoringapps
  ownerReferences:
  - apiVersion: opentelemetry.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: OpenTelemetryCollector
    name: otel-gateway
    uid: 8766a8asfadadadad
  resourceVersion: "549018"
  uid: f06sfaffffffff
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 600
        type: Pods
        value: 1
      selectPolicy: Min
      stabilizationWindowSeconds: 900
    scaleUp:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 2
      - periodSeconds: 60
        type: Percent
        value: 100
      selectPolicy: Max
      stabilizationWindowSeconds: 60
  maxReplicas: 6
  metrics:
  - resource:
      name: memory
      target:
        averageUtilization: 65
        type: Utilization
    type: Resource
  - resource:
      name: cpu
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  minReplicas: 3
  scaleTargetRef:
    apiVersion: opentelemetry.io/v1beta1
    kind: OpenTelemetryCollector
    name: otel-gateway
status:
  conditions:
  - lastTransitionTime: "2024-09-14T06:56:14Z"
    message: recommended size matches current size
    reason: ReadyForNewScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2024-09-14T09:35:19Z"
    message: the HPA was able to successfully calculate a replica count from memory
      resource utilization (percentage of request)
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2024-09-15T06:39:18Z"
    message: the desired replica count is more than the maximum replica count
    reason: TooManyReplicas
    status: "True"
    type: ScalingLimited
  currentMetrics:
  - resource:
      current:
        averageUtilization: 108
        averageValue: 72924501333m
      name: memory
    type: Resource
  - resource:
      current:
        averageUtilization: 3
        averageValue: 1m
      name: cpu
    type: Resource
  currentReplicas: 6
  desiredReplicas: 6
  lastScaleTime: "2024-09-15T06:39:18Z"
shine17 commented 1 month ago

@jaronoff97 Could you add a test case for memory based autoscale here, so that you can reproduce it.

https://github.com/open-telemetry/opentelemetry-operator/tree/main/tests/e2e-autoscale/autoscale

jaronoff97 commented 1 month ago

@shine17 we already have a test case for CPU, i copied it for memory and was unable to reproduce your bug. Is this an issue with the operator? you mentioned deployment.yaml, where is that coming from for you? Is it possible your helm chart is misconfigured?

https://github.com/open-telemetry/opentelemetry-operator/pull/3293

if you are able to reproduce this locally, can you please provide a full working example?