open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

[receiver/k8scluster] Use newer v2 HorizontalPodAutoscaler for Kubernetes 1.26 #20480

Closed jvoravong closed 1 year ago

jvoravong commented 1 year ago

Component(s)

receiver/k8scluster

What happened?

Description

Right now we only support v2beta2 HPA. To support Kubernetes v1.26, we need to add support for v2 HPA. Kubernetes v1.26 was released in December 2022. This version is still new and distributions like AKS, EKS, Openshift, and GKE will start using it soon (if not already).

Related Startup Log Warning Message: autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler `

Steps to Reproduce

Spin up a Kubernetes 1.25 cluster. Deploy the k8scluster receiver to your cluster. Follow the startup logs of the collector and you will notice the error log mentioned above.

Expected Result

The k8scluster can monitor v2 HorizontalPodAutoscaler objects.

Actual Result

In Kubernetes 1.25, you get a warning within the collector logs. In Kubernetes 1.26, you will get an error in the logs and users might notice HPA metrics are missing that they were expecting.

Collector version

v0.72.0

Environment information

Environment

Will affect all Kubernetes 1.26 cluseters. I tested and found the related log warnings in Rosa 4.12 (Openshift 4.12, Kubernetes 1.25).

OpenTelemetry Collector configuration

---
# Source: https://github.com/signalfx/splunk-otel-collector-chart/blob/main/examples/collector-cluster-receiver-only/rendered_manifests/configmap-cluster-receiver.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: default-splunk-otel-collector-otel-k8s-cluster-receiver
  labels:
    app.kubernetes.io/name: splunk-otel-collector
    helm.sh/chart: splunk-otel-collector-0.72.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: default
    app.kubernetes.io/version: "0.72.0"
    app: splunk-otel-collector
    chart: splunk-otel-collector-0.72.0
    release: default
    heritage: Helm
data:
  relay: |
    exporters:
      signalfx:
        access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
        api_url: https://api.CHANGEME.signalfx.com
        ingest_url: https://ingest.CHANGEME.signalfx.com
        timeout: 10s
      splunk_hec/o11y:
        disable_compression: true
        endpoint: https://ingest.CHANGEME.signalfx.com/v1/log
        log_data_enabled: true
        profiling_data_enabled: false
        token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
    extensions:
      health_check: null
      memory_ballast:
        size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
    processors:
      batch: null
      memory_limiter:
        check_interval: 2s
        limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
      resource:
        attributes:
        - action: insert
          key: metric_source
          value: kubernetes
        - action: upsert
          key: k8s.cluster.name
          value: CHANGEME
      resource/add_collector_k8s:
        attributes:
        - action: insert
          key: k8s.node.name
          value: ${K8S_NODE_NAME}
        - action: insert
          key: k8s.pod.name
          value: ${K8S_POD_NAME}
        - action: insert
          key: k8s.pod.uid
          value: ${K8S_POD_UID}
        - action: insert
          key: k8s.namespace.name
          value: ${K8S_NAMESPACE}
      resource/k8s_cluster:
        attributes:
        - action: insert
          key: receiver
          value: k8scluster
      resourcedetection:
        detectors:
        - env
        - system
        override: true
        timeout: 10s
      transform/add_sourcetype:
        log_statements:
        - context: log
          statements:
          - set(resource.attributes["com.splunk.sourcetype"], Concat(["kube:object:",
            attributes["k8s.resource.name"]], ""))
    receivers:
      k8s_cluster:
        auth_type: serviceAccount
        metadata_exporters:
        - signalfx
      k8sobjects:
        auth_type: serviceAccount
        objects:
        - field_selector: status.phase=Running
          interval: 15m
          label_selector: environment in (production),tier in (frontend)
          mode: pull
          name: pods
        - group: events.k8s.io
          mode: watch
          name: events
          namespaces:
          - default
      prometheus/k8s_cluster_receiver:
        config:
          scrape_configs:
          - job_name: otel-k8s-cluster-receiver
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${K8S_POD_IP}:8889
    service:
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        logs/objects:
          exporters:
          - splunk_hec/o11y
          processors:
          - memory_limiter
          - batch
          - resourcedetection
          - resource
          - transform/add_sourcetype
          receivers:
          - k8sobjects
        metrics:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource
          - resource/k8s_cluster
          receivers:
          - k8s_cluster
        metrics/collector:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource/add_collector_k8s
          - resourcedetection
          - resource
          receivers:
          - prometheus/k8s_cluster_receiver
      telemetry:
        metrics:
          address: 0.0.0.0:8889

Log output

W0329 15:21:31.802913       1 warnings.go:70] autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
W0329 15:29:19.805634       1 warnings.go:70] autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler

Additional context

Related to: https://github.com/signalfx/splunk-otel-collector/issues/2457

github-actions[bot] commented 1 year ago

Pinging code owners for receiver/k8scluster: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

AchimGrolimund commented 1 year ago

It is also on the Collector version: v0.73.0

and it is not only for the HPA... it is also related to the v1beta1.CronJob

See Example of my Logfile. splunk-otel-collector-agent-96r7z-splunk-otel-collector-agent.log

jvoravong commented 1 year ago

@AchimGrolimund can you please provide more details about your Kubernetes environment?

I didn't see this issue in my Kops created Kubernetes 1.25 cluster. We have support for batchv1.CronJob so I'm wondering how this is happening.

AchimGrolimund commented 1 year ago

Hello @jvoravong We are using ROSA 4.12

https://docs.openshift.com/container-platform/4.12/release_notes/ocp-4-12-release-notes.html

Next week, i can provide more infos.

We are using the splunk-otel-collector v0.72.0

Gesendet von Outlook für iOShttps://aka.ms/o0ukef


Von: jvoravong @.> Gesendet: Friday, April 7, 2023 4:20:08 PM An: open-telemetry/opentelemetry-collector-contrib @.> Cc: Achim Grolimund @.>; Mention @.> Betreff: Re: [open-telemetry/opentelemetry-collector-contrib] [receiver/k8scluster] Use newer v2 HorizontalPodAutoscaler for Kubernetes 1.26 (Issue #20480)

@AchimGrolimundhttps://github.com/AchimGrolimund can you please provide more details about your Kubernetes environment?

I didn't see this issue in my Kops created Kubernetes 1.25 cluster. We have support for (batchv1.CronJob:)[https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/315fdf3e571088c855f359b85e79cfd6d3ad9e50/receiver/k8sclusterreceiver/internal/collection/collector.go#L136] so I'm wondering how this is happening.

— Reply to this email directly, view it on GitHubhttps://github.com/open-telemetry/opentelemetry-collector-contrib/issues/20480#issuecomment-1500333061, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFIBOX72FYLSMKX4SGAEBX3XAAPBRANCNFSM6AAAAAAWMC3A6Q. You are receiving this because you were mentioned.Message ID: @.***>

iblancasa commented 1 year ago

I can help supporting HorizontalPodAutoscaler v2

AchimGrolimund commented 1 year ago

@jvoravong Sorry for my late reply.

We are currently using the following version: https://github.com/signalfx/splunk-otel-collector/releases/tag/v0.76.0

$ oc version
Client Version: 4.12.0-202303081116.p0.g846602e.assembly.stream-846602e
Kustomize Version: v4.5.7
Server Version: 4.12.11
Kubernetes Version: v1.25.7+eab9cc9

and here still the logs:

...
2023-05-03T10:45:44.563Z info service/service.go:129 Starting otelcol... {"Version": "v0.76.0", "NumCPU": 16}
....
W0503 10:45:48.056292 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:48.056337 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:45:49.019103 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:49.019186 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:53.008856 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:53.008902 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:45:53.133807 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:53.133863 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:59.810228 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:45:59.810287 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource
W0503 10:45:59.818576 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
E0503 10:45:59.818624 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource
W0503 10:46:16.106509 1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.CronJob: the server could not find the requested resource
E0503 10:46:16.106555 1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.CronJob: failed to list *v1beta1.CronJob: the server could not find the requested resource

Can we expect a solution soon?

salapatt commented 1 year ago

What is supported is batchv1.CronJob, but the question v1beta1.CronJob and v2beta1.HorizontalPodAutoscaler are taken care of in the code.

please provide an ETA

AchimGrolimund commented 1 year ago

Here some additional Informations:

$ oc get apirequestcounts -o jsonpath='{range .items[?(@.status.removedInRelease!="")]}{.status.removedInRelease}{"\t"}{.metadata.name}{"\n"}{end}' | sort
1.25    cronjobs.v1beta1.batch
1.25    horizontalpodautoscalers.v2beta1.autoscaling
1.26    horizontalpodautoscalers.v2beta2.autoscaling
jvoravong commented 1 year ago

Looking into this, will get back here soon.

salapatt commented 1 year ago

Thanks @jvoravong I am the support engineer on this CASE 3182925, appreciate your help on this.

jvoravong commented 1 year ago

I did miss adding a watcher for the HPA v2 code. Got a fix started for it. I verified k8s.hpa. and k8s.job. metrics are exported in Kubernetes 1.25 and 1.26. Couldn't get the HPA warnings to stop though on 1.25 even with this last fix, I think it's due to how we watch for both versions of HPA.

dmitryax commented 1 year ago

Couldn't get the HPA warnings to stop though on 1.25 even with this last fix, I think it's due to how we watch for both versions of HPA.

That's fine. We have the same for jobs when both versions supported by the k8s API

dmitryax commented 1 year ago

Closing as resolved by https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/21497

dmitryax commented 1 year ago

@AchimGrolimund, looking at the log output splunk-otel-collector-agent-96r7z-splunk-otel-collector-agent.log, it seems like the errors are coming from smartagent/openshift-cluster not from k8scluster receiver. Do you have k8scluster receiver enabled in the collector pipelines?

AchimGrolimund commented 1 year ago

Hey @dmitryax Here is our Configmap:

---
# Source: splunk-otel-collector/templates/configmap-agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: splunk-otel-collector-agent-configmap
  namespace: xxxxxxxx-splunk-otel-collector
  labels:
    app: splunk-otel-collector-agent
data:
  relay: |
    exporters:
      sapm:
        access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
        endpoint: https://xxxxxx:443/ingest/v2/trace
      signalfx:
        access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
        api_url: https://xxxxxxx:443/api/
        correlation: null
        ingest_url: https://xxxxxxx:443/ingest/
        sync_host_metadata: true
    extensions:
      health_check: null
      k8s_observer:
        auth_type: serviceAccount
        node: ${K8S_NODE_NAME}
      memory_ballast:
        size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
      zpages: null
    processors:
      batch: null
      filter/logs:
        logs:
          exclude:
            match_type: strict
            resource_attributes:
            - key: splunk.com/exclude
              value: "true"
      groupbyattrs/logs:
        keys:
        - com.splunk.source
        - com.splunk.sourcetype
        - container.id
        - fluent.tag
        - istio_service_name
        - k8s.container.name
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.uid
      k8sattributes:
        extract:
          annotations:
          - from: pod
            key: splunk.com/sourcetype
          - from: namespace
            key: splunk.com/exclude
            tag_name: splunk.com/exclude
          - from: pod
            key: splunk.com/exclude
            tag_name: splunk.com/exclude
          - from: namespace
            key: splunk.com/index
            tag_name: com.splunk.index
          - from: pod
            key: splunk.com/index
            tag_name: com.splunk.index
          labels:
          - key: app
          metadata:
          - k8s.namespace.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - container.id
          - container.image.name
          - container.image.tag
        filter:
          node_from_env_var: K8S_NODE_NAME
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: ip
        - sources:
          - from: connection
        - sources:
          - from: resource_attribute
            name: host.name
      memory_limiter:
        check_interval: 2s
        limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
      resource:
        attributes:
        - action: insert
          key: k8s.node.name
          value: ${K8S_NODE_NAME}
        - action: upsert
          key: k8s.cluster.name
          value: HCP-ROSA-PROD1
      resource/add_agent_k8s:
        attributes:
        - action: insert
          key: k8s.pod.name
          value: ${K8S_POD_NAME}
        - action: insert
          key: k8s.pod.uid
          value: ${K8S_POD_UID}
        - action: insert
          key: k8s.namespace.name
          value: ${K8S_NAMESPACE}
      resource/logs:
        attributes:
        - action: upsert
          from_attribute: k8s.pod.annotations.splunk.com/sourcetype
          key: com.splunk.sourcetype
        - action: delete
          key: k8s.pod.annotations.splunk.com/sourcetype
        - action: delete
          key: splunk.com/exclude
      resourcedetection:
        detectors:
        - env
        - ec2
        - system
        override: true
        timeout: 10s
    receivers:
      smartagent/openshift-cluster:
        type: openshift-cluster
        alwaysClusterReporter: true
        kubernetesAPI:
          authType: serviceAccount
        datapointsToExclude:
        - dimensions:
          metricNames:
            - '*appliedclusterquota*'
            - '*clusterquota*'
        extraMetrics:
          - kubernetes.container_cpu_request
          - kubernetes.container_memory_request
          - kubernetes.job.completions
          - kubernetes.job.active
          - kubernetes.job.succeeded
          - kubernetes.job.failed
      hostmetrics:
        collection_interval: 10s
        scrapers:
          cpu: null
          disk: null
          filesystem: null
          load: null
          memory: null
          network: null
          paging: null
          processes: null
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
      kubeletstats:
        auth_type: serviceAccount
        collection_interval: 10s
        endpoint: ${K8S_NODE_IP}:10250
        extra_metadata_labels:
        - container.id
        metric_groups:
        - container
        - pod
        - node
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      prometheus/agent:
        config:
          scrape_configs:
          - job_name: otel-agent
            scrape_interval: 10s
            static_configs:
            - targets:
              - 127.0.0.1:8889
      receiver_creator:
        receivers:
          smartagent/coredns:
            config:
              extraDimensions:
                metric_source: k8s-coredns
              port: 9154
              skipVerify: true
              type: coredns
              useHTTPS: true
              useServiceAccount: true
            rule: type == "pod" && namespace == "openshift-dns" && name contains "dns"
          smartagent/kube-controller-manager:
            config:
              extraDimensions:
                metric_source: kubernetes-controller-manager
              port: 10257
              skipVerify: true
              type: kube-controller-manager
              useHTTPS: true
              useServiceAccount: true
            rule: type == "pod" && labels["app"] == "kube-controller-manager" && labels["kube-controller-manager"]
              == "true"
          smartagent/kubernetes-apiserver:
            config:
              extraDimensions:
                metric_source: kubernetes-apiserver
              skipVerify: true
              type: kubernetes-apiserver
              useHTTPS: true
              useServiceAccount: true
            rule: type == "port" && port == 6443 && pod.labels["app"] == "openshift-kube-apiserver"
              && pod.labels["apiserver"] == "true"
          smartagent/kubernetes-proxy:
            config:
              extraDimensions:
                metric_source: kubernetes-proxy
              #port: 29101
              port: 9101
              useHTTPS: true
              skipVerify: true
              useServiceAccount: true
              type: kubernetes-proxy
            rule: type == "pod" && labels["app"] == "sdn"
          smartagent/kubernetes-scheduler:
            config:
              extraDimensions:
                metric_source: kubernetes-scheduler
              # port: 10251
              port: 10259
              type: kubernetes-scheduler
              useHTTPS: true
              skipVerify: true
              useServiceAccount: true
            rule: type == "pod" && labels["app"] == "openshift-kube-scheduler" && labels["scheduler"]
              == "true"
        watch_observers:
        - k8s_observer
      signalfx:
        endpoint: 0.0.0.0:9943
      smartagent/signalfx-forwarder:
        listenAddress: 0.0.0.0:9080
        type: signalfx-forwarder
      zipkin:
        endpoint: 0.0.0.0:9411
    service:
      extensions:
      - health_check
      - k8s_observer
      - memory_ballast
      - zpages
      pipelines:
        metrics:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resourcedetection
          - resource
          receivers:
          - hostmetrics
          - kubeletstats
          - otlp
          - receiver_creator
          - signalfx
          - smartagent/openshift-cluster
        metrics/agent:
          exporters:
          - signalfx
          processors:
          - memory_limiter
          - batch
          - resource/add_agent_k8s
          - resourcedetection
          - resource
          receivers:
          - prometheus/agent
        traces:
          exporters:
          - sapm
          - signalfx
          processors:
          - memory_limiter
          - k8sattributes
          - batch
          - resourcedetection
          - resource
          receivers:
          - otlp
          - jaeger
          - smartagent/signalfx-forwarder
          - zipkin
      telemetry:
        metrics:
          address: 127.0.0.1:8889

Best Regards Achim

dmitryax commented 1 year ago

@AchimGrolimund Thank you. This is coming from smartagent/openshift-cluster. So it's unrelated to this issue and has to be solved separately. @jvoravong can you please follow up on this? I'm not sure if we have an OTel native receiver to replace it with

dmitryax commented 1 year ago

Looks like k8scluster receiver supports scraping additional OpenShift metrics https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8sclusterreceiver#openshift, but it should be run separately as 1-replica deployment. @AchimGrolimund did you try it by chance?

Borrelworst commented 1 year ago

Just to add, in case of Azure you will not be able to upgrade from 1.25. to 1.26. as the agent is still querying the v2beta2 autoscaler API. As Azure prevents upgrading when deprecated API's are still being used the upgrade fails. You either have to force the upgrade, or remove the signalfx agent, wait for 12hours and then try again.

Would be nice if the agent checks the kubernetes version, if higher then 1.25 then do not monitoring the /apis/autoscaling/v2beta2/horizontalpodautoscalers api endpoint.

salapatt commented 1 year ago

The customer xxx updated the Splunk OTC agent to version 0.77.0 and still gets the same error messages.

W0522 06:11:24.226426 1 reflector.go:533] k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231mailto:[k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231](https://k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231): failed to list v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource 129E0522 06:11:24.226454 1 reflector.go:148] k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231mailto:[k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231](https://k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231): Failed to watch v2beta1.HorizontalPodAutoscaler: failed to list *v2beta1.HorizontalPodAutoscaler: the server could not find the requested resource

jvoravong commented 1 year ago

Update on Deprecated Endpoint Removal:

Additional Context: