kubernetes-sigs / prometheus-adapter

An implementation of the custom.metrics.k8s.io API using Prometheus
Apache License 2.0
1.9k stars 551 forks source link

Correct Configuration Fails to Provide Expected Custom Metrics in EKS #663

Open wuyudian1 opened 3 months ago

wuyudian1 commented 3 months ago

What happened?: Correct Configuration Fails to Provide Expected Custom Metrics in EKS We have deployed identical Prometheus chart and Prometheus-Adapter chart in both Alibaba Cloud ACK cluster and AWS EKS cluster. The configurations of Prometheus and Prometheus-Adapter are the same in both K8S clusters. The scraping configuration for Prometheus is as follows:

job_name: basicai-business-queue-wait
metrics_path: /metrics/prometheus
scheme: http
scrape_interval: 30s
honor_labels: true
kubernetes_sd_configs:
  - role: service
    namespaces:
      names:
        - basicai-backend
        - basicai-stage-backend
relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    regex: dataset
    action: keep
  - source_labels: [__meta_kubernetes_namespace]
    target_label: 'kubernetes_namespace'
    action: replace
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    target_label: 'kubernetes_deployment'
    action: replace
  - source_labels: [__meta_kubernetes_service_port_number]
    regex: 80
    action: keep

The values.yaml for Prometheus-Adapter chart is as follows:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])

In the Alibaba Cloud ACK cluster, the Prometheus-Adapter correctly provides custom metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "deployments.apps/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "jobs.batch/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

However, in the EKS cluster, the Prometheus-Adapter provides a large number of default metrics, but does not include the expected 'upload_job_replica_scale_percent_dataset':

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | head -n 50
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "services/authentication_duration_seconds_sum",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    .....
    .....
    .....

What did you expect to happen?: prometheus-adapter provides correct custom metrics in AWS EKS cluster as in Alibaba Cloud ACK cluster

Please provide the prometheus-adapter config:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])
dgrisonnet commented 2 months ago

/triage accepted /help

k8s-ci-robot commented 2 months ago

@dgrisonnet: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/prometheus-adapter/issues/663): >/triage accepted >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.