kubernetes-sigs / prometheus-adapter

An implementation of the custom.metrics.k8s.io API using Prometheus
Apache License 2.0
1.9k stars 551 forks source link

Kubectl --raw reporting an unknown metric even though it shows up in the list of known metrics #641

Open evin-bz opened 6 months ago

evin-bz commented 6 months ago

What happened?: Getting HPA reporting status unknown for a given metric, other metrics seem to work fine:

  "error_rate_metric" on Ingress/my-ingress (target value):             <unknown> / 1
...
  Warning  FailedGetObjectMetric  83s (x95 over 25m)    horizontal-pod-autoscaler  unable to get metric error_rate_metric: Ingress on my-namespace my-ingress/unable to fetch metrics from custom metrics API: the server could not find the metric error_rate_metric for ingresses.networking.k8s.io my-ingress

What did you expect to happen?: Custom metric reports back with at least 1 given the query being used

Please provide the prometheus-adapter config:

The config for this metric is fairly simple, and in theory should always return SOME value via the clamp_min:

  - metricsQuery: clamp_min(round(sum(rate(<<.Series>>{<<.LabelMatchers>>,status=~"^5.."}[1m])) or vector(0.00001) / sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m]) ), 0.01) * 100, 1)
    resources:
      template: <<.Resource>>
    name:
      as: error_rate_metric
    seriesFilters: []
    seriesQuery: '{__name__="nginx_ingress_controller_requests",ingress="my-ingress",namespace!=""}'

Please provide the HPA resource used for autoscaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  maxReplicas: 1
  metrics:
  - object:
      describedObject:
        apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: my-ingress
      metric:
        name: nginx_ingress_controller_requests_rate_my_ingress_ingress
      target:
        averageValue: "75"
        type: AverageValue
        value: "0"
    type: Object
  - object:
      describedObject:
        apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: my-ingress
      metric:
        name: nginx_ingress_controller_response_duration_p95_my_ingress_ingress
      target:
        type: Value
        value: "7"
    type: Object
  - object:
      describedObject:
        apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: my-ingress
      metric:
        name: error_rate_metric
      target:
        type: Value
        value: "1"
    type: Object
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-deployment

Please provide the HPA status:

Conditions:
  Type            Status  Reason                 Message
  ----            ------  ------                 -------
  AbleToScale     True    SucceededGetScale      the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetObjectMetric  the HPA was unable to compute the replica count: unable to get metric error_rate_metric: Ingress on my-namespace my-ingress/unable to fetch metrics from custom metrics API: the server could not find the metric error_rate_metric for ingresses.networking.k8s.io my-ingress
  ScalingLimited  False   DesiredWithinRange     the desired count is within the acceptable range

Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:

Verbose logging in the adapter shows the following when trying to request the data from the HPA:

I0215 22:22:39.231327       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/custom.metrics.k8s.io/v1beta1/namespaces/my-namespace/ingresses.networking.k8s.io/my-ingress/error_rate_metric" latency="1.923735ms" userAgent="kube-controller-manager/v1.25.11 (linux/arm64) kubernetes/8cfcba0/system:serviceaccount:kube-system:horizontal-pod-autoscaler" audit-ID="a-b-c-d-e" srcIP="172.1.1.1:47017" resp=404

Other logs were present but not relevant the this error rate metric failing

Anything else we need to know?:

Querying prometheus for the what I expect it should be translating to shows data in response, to be clear though, the results have no labels:

# query: 

clamp_min(
    round(
        sum(
            rate(nginx_ingress_controller_requests{ingress="my-ingress",namespace!="",status=~"^5.."}[1m]) 
            ) or vector(0.00001)
    /
        sum(
            rate(nginx_ingress_controller_requests{ingress="my-ingress",namespace!=""}[1m]) 
            )
    , 0.01) * 100, 
1) 

# result:
{}   - 1

When querying via the RAW addresses in kubectl, I can see that this named metric does exist:

❯ kubectl --context=cluster-context get --raw '/apis/custom.metrics.k8s.io/v1beta1' | jq . | grep error_rate_metric
      "name": "jobs.batch/error_rate_metric",
      "name": "prometheuses.monitoring.coreos.com/error_rate_metric",
      "name": "pods/error_rate_metric",
      "name": "services/error_rate_metric",
      "name": "ingresses.networking.k8s.io/error_rate_metric",
      "name": "namespaces/error_rate_metric",

However when I attempt to query it I get a NotFound:

❯ kubectl --context=cluster-context get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/my-namespace/Ingress/my-ingress/error_rate_metric" | jq .
Error from server (NotFound): the server could not find the metric error_rate_metric for Ingress my-ingress

I expect this to at the very least show 1.

It seems like this issue may be related to this, however the fixes in there do not seem to have helped: https://github.com/kubernetes-sigs/prometheus-adapter/issues/150

Environment:

dashpole commented 6 months ago

/assign @dgrisonnet /triage accepted