Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API

saad946 commented 6 months ago

What happened?: I am constantly having this with one of my service hpa which is configured to scale based on custom metrics. Sometime hpa shows able to scale to True and able to get custom metrics most of the time not. Because of that hpa is not able to scale down the pods.

This our hpa description for one of the affected service.

Affected service hpa description:

  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    SucceededGetScale    the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetPodsMetric  the HPA was unable to compute the replica count: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range

While the other service using same hpa configuration not showing this error while describing its hpa. This hpa description from another service.

Running service hpa: This is a Random behaviour we observed in both services that sometime its able to collect custom metric and sometime not.

  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True   ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric DCGM_FI_DEV_FB_USED_AVG

What did you expect to happen?: Expect the same behaviour of prometheus adapter and hpa among the services if using same configuration for both services.

Please provide the prometheus-adapter config:

prometheus-adapter config

``` prometheus: url: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local port: 9090 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1" memory: "1Gi" rules: default: false custom: - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="", exported_container!="", exported_pod!=""}' name: as: "DCGM_FI_DEV_GPU_UTIL_AVG" resources: overrides: exported_namespace: {resource: "namespace"} exported_pod: {resource: "pod"} exported_container: {resource: "pod"} metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_pod!="",exported_container!=""}[1m])))' - seriesQuery: 'DCGM_FI_DEV_FB_USED{exported_namespace!="", exported_container!="", exported_pod!=""}' name: as: "DCGM_FI_DEV_FB_USED_AVG" resources: overrides: exported_namespace: {resource: "namespace"} exported_pod: {resource: "pod"} exported_container: {resource: "pod"} metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_FB_USED{exported_pod!="",exported_container!=""}[1m])))' - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}' name: as: "DCGM_FI_DEV_GPU_UTIL_MIN" resources: overrides: exported_container: {resource: "service"} exported_namespace: {resource: "namespace"} exported_pod: {resource: "pod"} metricsQuery: min by (exported_namespace, exported_container) (round(min_over_time(<<.Series>>[1m]))) - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}' name: as: "DCGM_FI_DEV_GPU_UTIL_MAX" resources: overrides: exported_container: {resource: "service"} exported_namespace: {resource: "namespace"} exported_pod: {resource: "pod"} metricsQuery: max by (exported_namespace, exported_container) (round(max_over_time(<<.Series>>[1m]))) ``` **When checking if metrics exist or not, got this response:** ``` kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_FB_USED_AVG "name": "pods/DCGM_FI_DEV_FB_USED_AVG", "name": "namespaces/DCGM_FI_DEV_FB_USED_AVG", ```

Please provide the HPA resource used for autoscaling:

HPA yaml

**HPA yaml for both service is here:** **Not Working one:** ``` apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: serviceA-memory-utilization-hpa namespace: development spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: serviceA minReplicas: 1 maxReplicas: 2 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_FB_USED_AVG target: type: AverageValue averageValue: 20000 ``` **Working One:** ``` apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: serviceB-memory-utilization-hpa namespace: common-service-development spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: serviceB minReplicas: 1 maxReplicas: 2 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_FB_USED_AVG target: type: AverageValue averageValue: 20000 ```

Please provide the HPA status:

We observed these events in both services time to time, also sometime it is able to collect the metric for ServiceB but not for serviceA most of the time.

Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Warning  FailedComputeMetricsReplicas  26m (x12 over 30m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
  Warning  FailedGetPodsMetric           22s (x74 over 30m)  horizontal-pod-autoscaler  unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API

And it is the HPA status, it seems it is able to get the memory utilization, but while we describe hpa we observed issues as stated earlier that hpa is unable to collect metrics neither trigger scaling activity.

serviceA-memory-utilization-hpa      Deployment/serviceA            19675/20k   1         2         1          14m
serviceB-memory-utilization-hpa      Deployment/serviceB             19675/20k   1         2         2          11m

Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:

prometheus-adapter logs

Anything else we need to know?:

Environment:

prometheus-adapter version: prometheus-adapter-3.2.2 v0.9.1
prometheus version: kube-prometheus-stack-56.6.2 v0.71.2
Kubernetes version (use kubectl version): Client Version: v1.28.3-eks-e71965b Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.26.12-eks-5e0fdde
Cloud provider or hardware configuration: AWS EKS
Other info:

dashpole commented 6 months ago

/cc @CatherineF-dev /assign @dgrisonnet /triage accepted

aurifolia commented 3 months ago

Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause

saad946 commented 3 months ago

Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause

That is not the only issue, as prometheus adapter failed to get GPU metrics, and not able to scale up and down kubernetes deployment and reflect the error of Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API in hpa describe command, with unknown status or ScalingActive becomes False.

dvp34 commented 2 months ago

Just curious, How does the raw data for DCGM_FI_DEV_GPU_UTIL{} from prometheus look like?

mayyyyying commented 2 months ago

something like this @dvp34


{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia1",
                    "endpoint": "gpu-metrics",
                    "exported_container": "triton",
                    "exported_namespace": "llm",
                    "exported_pod": "qwen-1gpu-75455d6c96-7jcxq",
                    "gpu": "1",
                    "instance": "10.42.0.213:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "NVIDIA L4",
                    "namespace": "gpu-operator",
                    "pod": "nvidia-dcgm-exporter-rlhcx",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "triton",
                    "device": "nvidia1",
                    "gpu": "1",
                    "instance": "10.42.0.213:9400",
                    "job": "gpu-metrics",
                    "kubernetes_node": "qxzg-l4server",
                    "modelName": "NVIDIA L4",
                    "namespace": "llm",
                    "pod": "qwen-1gpu-75455d6c96-7jcxq"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia0",
                    "endpoint": "gpu-metrics",
                    "exported_container": "triton",
                    "exported_namespace": "llm",
                    "exported_pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb",
                    "gpu": "0",
                    "instance": "10.42.0.213:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "NVIDIA L4",
                    "namespace": "gpu-operator",
                    "pod": "nvidia-dcgm-exporter-rlhcx",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "triton",
                    "device": "nvidia0",
                    "gpu": "0",
                    "instance": "10.42.0.213:9400",
                    "job": "gpu-metrics",
                    "kubernetes_node": "qxzg-l4server",
                    "modelName": "NVIDIA L4",
                    "namespace": "llm",
                    "pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            }
        ]
    }
}

kubernetes-sigs / prometheus-adapter

Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644