[receiver/k8scluster] Use newer v2 HorizontalPodAutoscaler for Kubernetes 1.26

jvoravong commented 1 year ago

Component(s)

receiver/k8scluster

What happened?

Description

Right now we only support v2beta2 HPA. To support Kubernetes v1.26, we need to add support for v2 HPA. Kubernetes v1.26 was released in December 2022. This version is still new and distributions like AKS, EKS, Openshift, and GKE will start using it soon (if not already).

Related Startup Log Warning Message: autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler `

Steps to Reproduce

Spin up a Kubernetes 1.25 cluster. Deploy the k8scluster receiver to your cluster. Follow the startup logs of the collector and you will notice the error log mentioned above.

Expected Result

The k8scluster can monitor v2 HorizontalPodAutoscaler objects.

Actual Result

In Kubernetes 1.25, you get a warning within the collector logs. In Kubernetes 1.26, you will get an error in the logs and users might notice HPA metrics are missing that they were expecting.

Collector version

v0.72.0

Environment information

Environment

Will affect all Kubernetes 1.26 cluseters. I tested and found the related log warnings in Rosa 4.12 (Openshift 4.12, Kubernetes 1.25).

OpenTelemetry Collector configuration

---
# Source: https://github.com/signalfx/splunk-otel-collector-chart/blob/main/examples/collector-cluster-receiver-only/rendered_manifests/configmap-cluster-receiver.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: default-splunk-otel-collector-otel-k8s-cluster-receiver
  labels:
    app.kubernetes.io/name: splunk-otel-collector
    helm.sh/chart: splunk-otel-collector-0.72.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: default
    app.kubernetes.io/version: "0.72.0"
    app: splunk-otel-collector
    chart: splunk-otel-collector-0.72.0
    release: default
    heritage: Helm
data:
  relay: |
    exporters:
      signalfx:
        access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
        api_url: https://api.CHANGEME.signalfx.com
        ingest_url: https://ingest.CHANGEME.signalfx.com
        timeout: 10s
      splunk_hec/o11y:
        disable_compression: true
        endpoint: https://ingest.CHANGEME.signalfx.com/v1/log
        log_data_enabled: true
        profiling_data_enabled: false
        token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
    extensions:
      health_check: null
      memory_ballast:
        size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
    processors:
      batch: null
      memory_limiter:
        check_interval: 2s
        limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
      resource:
        attributes:
        - action: insert
          key: metric_source
          value: kubernetes
        - action: upsert
          key: k8s.cluster.name
          value: CHANGEME
      resource/add_collector_k8s:
        attributes:
        - action: insert
          key: k8s.node.name
          value: ${K8S_NODE_NAME}
        - action: insert
          key: k8s.pod.name
          value: ${K8S_POD_NAME}
        - action: insert
          key: k8s.pod.uid
          value: ${K8S_POD_UID}
        - action: insert
          key: k8s.namespace.name
          value: ${K8S_NAMESPACE}
      resource/k8s_cluster:
        attributes:
        - action: insert
          key: receiver
          value: k8scluster
      resourcedetection:
        detectors:
        - env
        - system
        override: true
        timeout: 10s
      transform/add_sourcetype:
        log_statements:
        - context: log
          statements:
          - set(resource.attributes["com.splunk.sourcetype"], Concat(["kube:object:",
            attributes["k8s.resource.name"]], ""))
    receivers:
      k8s_cluster:
        auth_type: serviceAccount
        metadata_exporters:
        - signalfx
      k8sobjects:
        auth_type: serviceAccount
        objects:
        - field_selector: status.phase=Running
          interval: 15m
          label_selector: environment in (production),tier in (frontend)
          mode: pull
          name: pods
        - group: events.k8s.io
          mode: watch
          name: events
          namespaces:
          - default
      prometheus/k8s_cluster_receiver:
        config:
          scrape_configs:
          - job_name: otel-k8s-cluster-receiver
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${K8S_POD_IP}:8889
    service:
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        logs/objects:
          exporters:
          - splunk_hec/o11y
          processors:
          - memory_limiter
          - batch
          - resourcedetection
          - resource
          - transform/add_sourcetype
          receivers:
          - k8sobjects
        metrics:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource
          - resource/k8s_cluster
          receivers:
          - k8s_cluster
        metrics/collector:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource/add_collector_k8s
          - resourcedetection
          - resource
          receivers:
          - prometheus/k8s_cluster_receiver
      telemetry:
        metrics:
          address: 0.0.0.0:8889

Log output

W0329 15:21:31.802913       1 warnings.go:70] autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
W0329 15:29:19.805634       1 warnings.go:70] autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler

Additional context

github-actions[bot] commented 1 year ago

Pinging code owners for receiver/k8scluster: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

AchimGrolimund commented 1 year ago

It is also on the Collector version: v0.73.0

and it is not only for the HPA... it is also related to the v1beta1.CronJob

See Example of my Logfile. splunk-otel-collector-agent-96r7z-splunk-otel-collector-agent.log

jvoravong commented 1 year ago

@AchimGrolimund can you please provide more details about your Kubernetes environment?

I didn't see this issue in my Kops created Kubernetes 1.25 cluster. We have support for batchv1.CronJob so I'm wondering how this is happening.

AchimGrolimund commented 1 year ago

Hello @jvoravong We are using ROSA 4.12

https://docs.openshift.com/container-platform/4.12/release_notes/ocp-4-12-release-notes.html

Next week, i can provide more infos.

We are using the splunk-otel-collector v0.72.0

Gesendet von Outlook für iOShttps://aka.ms/o0ukef

Von: jvoravong @.> Gesendet: Friday, April 7, 2023 4:20:08 PM An: open-telemetry/opentelemetry-collector-contrib @.> Cc: Achim Grolimund @.>; Mention @.> Betreff: Re: [open-telemetry/opentelemetry-collector-contrib] [receiver/k8scluster] Use newer v2 HorizontalPodAutoscaler for Kubernetes 1.26 (Issue #20480)

@AchimGrolimundhttps://github.com/AchimGrolimund can you please provide more details about your Kubernetes environment?

I didn't see this issue in my Kops created Kubernetes 1.25 cluster. We have support for (batchv1.CronJob:)[https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/315fdf3e571088c855f359b85e79cfd6d3ad9e50/receiver/k8sclusterreceiver/internal/collection/collector.go#L136] so I'm wondering how this is happening.

— Reply to this email directly, view it on GitHubhttps://github.com/open-telemetry/opentelemetry-collector-contrib/issues/20480#issuecomment-1500333061, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFIBOX72FYLSMKX4SGAEBX3XAAPBRANCNFSM6AAAAAAWMC3A6Q. You are receiving this because you were mentioned.Message ID: @.***>

iblancasa commented 1 year ago

I can help supporting HorizontalPodAutoscaler v2

AchimGrolimund commented 1 year ago

@jvoravong Sorry for my late reply.

We are currently using the following version: https://github.com/signalfx/splunk-otel-collector/releases/tag/v0.76.0

$ oc version
Client Version: 4.12.0-202303081116.p0.g846602e.assembly.stream-846602e
Kustomize Version: v4.5.7
Server Version: 4.12.11
Kubernetes Version: v1.25.7+eab9cc9

and here still the logs:

...
2023-05-03T10:45:44.563Z info service/service.go:129 Starting otelcol... {"Version": "v0.76.0", "NumCPU": 16}
....
W0503 10:45:48.056292 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:48.056337 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:45:49.019103 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:49.019186 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:53.008856 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:53.008902 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:45:53.133807 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:53.133863 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:59.810228 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:59.810287 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:59.818576 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:59.818624 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:46:16.106509 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:46:16.106555 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource

Can we expect a solution soon?

salapatt commented 1 year ago

What is supported is batchv1.CronJob, but the question v1beta1.CronJob and v2beta1.HorizontalPodAutoscaler are taken care of in the code.

please provide an ETA

AchimGrolimund commented 1 year ago

Here some additional Informations:

$ oc get apirequestcounts -o jsonpath='{range .items[?(@.status.removedInRelease!="")]}{.status.removedInRelease}{"\t"}{.metadata.name}{"\n"}{end}' | sort
1.25    cronjobs.v1beta1.batch
1.25    horizontalpodautoscalers.v2beta1.autoscaling
1.26    horizontalpodautoscalers.v2beta2.autoscaling

jvoravong commented 1 year ago

Looking into this, will get back here soon.

salapatt commented 1 year ago

Thanks @jvoravong I am the support engineer on this CASE 3182925, appreciate your help on this.

jvoravong commented 1 year ago

I did miss adding a watcher for the HPA v2 code. Got a fix started for it. I verified k8s.hpa. and k8s.job. metrics are exported in Kubernetes 1.25 and 1.26. Couldn't get the HPA warnings to stop though on 1.25 even with this last fix, I think it's due to how we watch for both versions of HPA.