custom metrics rule with cloudwatch prometheus exporter

oussama-mechlaoui commented 3 years ago

I want to use cloudwatch prometheus metrics as custom metrics. I have defined the custom rule as follows:

- metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[5m]) by (<<.GroupBy>>)'
  name:
    as: '{1}_per_second'
    matches: '^(.*)_sum'
  resources:
    overrides:
      namespace:
        resource: 'namespace'
  seriesQuery: 'aws_applicationelb_request_count_sum{namespace!="",load_balancer="app/k8s-eksmonitoring-1bb56c3370/16ae9f70d2e5fad4"}'

The list of custom metrics is empty.

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq . { "kind": "APIResourceList", "apiVersion": "v1", "groupVersion": "custom.metrics.k8s.io/v1beta1", "resources": [] }

I can query the cloudwatch metric from prometheus server directly.

curl -s 'http://prometheus-k8s.monitoring:9090/api/v1/series?match[]=aws_applicationelb_request_count_sum\{load_balancer="app/k8s-eksmonitoring-1bb56c3370/16ae9f70d2e5fad4",namespace!=""}' | jq .
{ "status": "success", "data": [ { "name": "aws_applicationelb_request_count_sum", "container": "prometheus-cloudwatch-exporter", "endpoint": "http", "exported_job": "aws_applicationelb", "instance": "192.168.2.127:9106", "job": "prometheus-cloudwatch-exporter", "load_balancer": "app/k8s-eksmonitoring-1bb56c3370/16ae9f70d2e5fad4", "namespace": "monitoring", "pod": "cw-prometheus-exporter-prometheus-cloudwatch-exporter-9fb7dc9ct", "service": "cw-prometheus-exporter-prometheus-cloudwatch-exporter" } ] }

Can you point out the issue?

oussama-mechlaoui commented 3 years ago

I have tried the following:

- metricsQuery: '{namespace!="",__name__=~"^aws_applicationelb_request_.*",load_balancer="app/k8s-eksmonitoring-1bb56c3370/16ae9f70d2e5fad4"}'
  name:
    as: ''
    matches: '^aws_(.*)_sum$'
  resources:
    overrides:
      namespace:
        resource: 'namespace'
      pod:
        resource: 'pod'
  seriesQuery: '{namespace!="",__name__=~"^aws_applicationelb_request_.*",load_balancer="app/k8s-eksmonitoring-1bb56c3370/16ae9f70d2e5fad4"}'

The list of custom metrics is empty.

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq . { "kind": "APIResourceList", "apiVersion": "v1", "groupVersion": "custom.metrics.k8s.io/v1beta1", "resources": [] }

I have enabled debug for prometheus-adapter (v=10), I can see the following in the logs:

I0412 15:35:51.303854 1 api.go:74] GET http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/series?match%5B%5D=%7Bnamespace%21%3D%22%22%2C__name__%3D~%22%5Eaws_applicationelb_request_.%2A%22%2Cload_balancer%3D%22app%2Fk8s-eksmonitoring-1bb56c3370%2F16ae9f70d2e5fad4%22%7D&start=1618241691.302 200 OK I0412 15:35:51.303924 1 api.go:93] Response Body: {"status":"success","data":[]} I0412 15:35:51.303973 1 provider.go:279] Set available metric list from Prometheus to: [[]]

When I try to curl directly the prometheus-server, I with same query link generated by the adapter I get the following results:

curl -s http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/series?match%5B%5D=%7Bnamespace%21%3D%22%22%2C__name__%3D~%22%5Eaws_applicationelb_request_.%2A%22%2Cload_balancer%3D%22app%2Fk8s-eksmonitoring-1bb56c3370%2F16ae9f70d2e5fad4%22%7D | jq .
{ "status": "success", "data": [ { "name": "aws_applicationelb_request_count_sum", "container": "prometheus-cloudwatch-exporter", "endpoint": "http", "exported_job": "aws_applicationelb", "instance": "192.168.2.127:9106", "job": "prometheus-cloudwatch-exporter", "load_balancer": "app/k8s-eksmonitoring-1bb56c3370/16ae9f70d2e5fad4", "namespace": "monitoring", "pod": "cw-prometheus-exporter-prometheus-cloudwatch-exporter-9fb7dc9ct", "service": "cw-prometheus-exporter-prometheus-cloudwatch-exporter" } ] }

Any insights, how can I use cloudwatch metrics with prometheus-adapter?

sergdpi commented 3 years ago

Metrics from AWS Cloudwatch in Prometheus have a time lag. see timestamps Adapter do not create resource in APIResourceList If Prometheus return null. Default Helm timestamp value - 1m.

I added ARG --metrics-max-age=15m. This parameter isn't documented anywhere and also the Helm chart doesn't make use of it. If --metrics-max-age is not explicitly set, set it equal to --metrics-relist-interval. In this case, it is as if the --metrics-max-age.

My prometheus-adapter deployment:

       containers:
      - args:
        - /adapter
        - --secure-port=6443
        - --cert-dir=/tmp/cert
        - --logtostderr=true
        - --prometheus-url=http://prometheus-operated:9090
        - --metrics-relist-interval=1m
        - --v=4
        - --config=/etc/adapter/config.yaml
        - --metrics-max-age=90m

My ConfigMap prometheus-adapter-custom:

    rules:
    - seriesQuery: 'aws_sqs_approximate_number_of_messages_visible_average{container="prometheus-cloudwatch-exporter",exported_job="aws_sqs",queue_name="mysqs"}'
      seriesFilters: []
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        as: "aws_sqs_approximate_number_of_messages_visible_average_mysqs"
        matches: ""
      metricsQuery: max_over_time(aws_sqs_approximate_number_of_messages_visible_average{<<.LabelMatchers>>}[15m])

Result:

kubectl -n monitoring get --raw="/apis/custom.metrics.k8s.io/v1beta1" | jq .
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "namespaces/aws_sqs_approximate_number_of_messages_visible_average_mysqs",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "pods/aws_sqs_approximate_number_of_messages_visible_average_mysqs",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

❯ kubectl -n monitoring get --raw="/apis/custom.metrics.k8s.io/v1beta1/namespaces/monitoring/pods/*/aws_sqs_approximate_number_of_messages_visible_average_mysqs" | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/monitoring/pods/%2A/aws_sqs_approximate_number_of_messages_visible_average_mysqs"
  },
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "monitoring",
        "name": "cloudwatch-exporter-prometheus-cloudwatch-exporter-78f5d7cjst2z",
        "apiVersion": "/v1"
      },
      "metricName": "aws_sqs_approximate_number_of_messages_visible_average_shalb",
      "timestamp": "2021-05-20T08:26:32Z",
      "value": "39",
      "selector": null
    }
  ]
}

It's working.

guidoilbaldo commented 3 years ago

Hello @sergdpi , I'm trying to do almost the same query that you did with AWS SQS number of messages visible. Mine actually looks like this:

external:
  - seriesQuery: 'aws_sqs_approximate_number_of_messages_visible_average{queue_name="mysqs"}'
        seriesFilters: []
        resources:
          overrides:
            kubernetes_namespace:
              resource: namespace
        name:
          as: "approximate_number_of_messages_visible_mysqs"
          matches: ""
        metricsQuery: max_over_time(aws_sqs_approximate_number_of_messages_visible_average{<<.LabelMatchers>>}[15m])

Sadly, I get the following in my adapter logs

I0811 10:22:09.562542       1 api.go:74] GET http://prometheus:80/api/v1/series?match%5B%5D=aws_sqs_approximate_number_of_messages_visible_average%7Bqueue_name%3D%22mysqs%22%7D&start=1628677269.56 200 OK
I0811 10:22:09.562618       1 api.go:93] Response Body: {"status":"success","data":[]}

The GET to prometheus returns an empty result even though there are metrics in Prometheus itself

aws_sqs_approximate_number_of_messages_visible_average{app="prometheus-cloudwatch-exporter", app_kubernetes_io_managed_by="Helm", chart="prometheus-cloudwatch-exporter-0.16.0", exported_job="aws_sqs", heritage="Helm", instance="10.x.x.x:9106", job="kubernetes-service-endpoints", kubernetes_name="prometheus-cloudwatch-exporter", kubernetes_namespace="monitoring", kubernetes_node="kube-worker-1", queue_name="mysqs", release="prometheus-cloudwatch-exporter"}

So, I'm not able to get those metrics in k8s external apis (even though I have other external metrics taken from prometheus, i.e. the number of http_requests that I use for another type of HPA)

k get --raw /apis/external.metrics.k8s.io/v1beta1 | jq .
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "http_server_requests_rate_10m",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "http_server_requests_seconds_count",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "http_server_requests_rate_5m",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Do some of you have any clue on why I'm not able to get those metrics to work? I'm using the latest prom adapter version 0.8.4 and I've also set the --metrics-max-age=15m argument in the deploy. Thanks for your help!

guidoilbaldo commented 3 years ago

I've found the solution to my problems myself and I'm posting them here in case someone else might incur in the same issues. I'll start from the helm chart values.yaml parameters that I've added:

image:
  repository: gcr.io/k8s-staging-prometheus-adapter/prometheus-adapter
  tag: master
  pullPolicy: IfNotPresent
metricsRelistInterval: 5m
 ...
- seriesQuery: 'aws_sqs_approximate_number_of_messages_visible_average{kubernetes_namespace!="",queue_name!=""}'
      resources:
        namespaced: false
        overrides:
          kubernetes_namespace:
            resource: namespace
      metricsQuery: max_over_time(<<.Series>>{<<.LabelMatchers>>}[10m])

So, the adapter is fairly straightforward: I've been using the stage image as suggested in another issue because, this way, I've been able to set the metric with the namespaced: false parameter (I need it because I have a generic "monitoring" namespace that gathers all the metrics for the applications running in their namespaces). Then, I've added the --metrics-relist-interval=5m because, as said by @sergdpi , CloudWatch takes some minutes to converge metrics before they're available for scraping. I didn't set the other mentioned parameter (--metrics-max-age) because it should be automatically equal to the first one. Finally, in the HPA configuration, I did the following:

metrics:
  - type: External
    external:
      metricName: aws_sqs_approximate_number_of_messages_visible_average
      metricSelector:
        matchLabels:
          queue_name: mysqs
          kubernetes_namespace: monitoring
      targetValue: 10

Now my HPA is finally able to collect metrics (remember that they are in a different namespace) from kube external metrics API. My new metric is showed below:

  {
      "name": "aws_sqs_approximate_number_of_messages_visible_average",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    },

Thanks everyone for the help and the maintainers for their work on this tool!

fpetkovski commented 3 years ago

Great stuff, thank you for posting the solution 🎉

Can this issue be closed now?

oussama-mechlaoui commented 2 years ago

Metrics from AWS Cloudwatch in Prometheus have a time lag. see timestamps Adapter do not create resource in APIResourceList If Prometheus return null. Default Helm timestamp value - 1m.

I added ARG --metrics-max-age=15m. This parameter isn't documented anywhere and also the Helm chart doesn't make use of it. If --metrics-max-age is not explicitly set, set it equal to --metrics-relist-interval. In this case, it is as if the --metrics-max-age.

My prometheus-adapter deployment:

       containers:
      - args:
        - /adapter
        - --secure-port=6443
        - --cert-dir=/tmp/cert
        - --logtostderr=true
        - --prometheus-url=http://prometheus-operated:9090
        - --metrics-relist-interval=1m
        - --v=4
        - --config=/etc/adapter/config.yaml
        - --metrics-max-age=90m

My ConfigMap prometheus-adapter-custom:

    rules:
    - seriesQuery: 'aws_sqs_approximate_number_of_messages_visible_average{container="prometheus-cloudwatch-exporter",exported_job="aws_sqs",queue_name="mysqs"}'
      seriesFilters: []
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        as: "aws_sqs_approximate_number_of_messages_visible_average_mysqs"
        matches: ""
      metricsQuery: max_over_time(aws_sqs_approximate_number_of_messages_visible_average{<<.LabelMatchers>>}[15m])

Result:

kubectl -n monitoring get --raw="/apis/custom.metrics.k8s.io/v1beta1" | jq .
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "namespaces/aws_sqs_approximate_number_of_messages_visible_average_mysqs",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "pods/aws_sqs_approximate_number_of_messages_visible_average_mysqs",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

❯ kubectl -n monitoring get --raw="/apis/custom.metrics.k8s.io/v1beta1/namespaces/monitoring/pods/*/aws_sqs_approximate_number_of_messages_visible_average_mysqs" | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/monitoring/pods/%2A/aws_sqs_approximate_number_of_messages_visible_average_mysqs"
  },
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "monitoring",
        "name": "cloudwatch-exporter-prometheus-cloudwatch-exporter-78f5d7cjst2z",
        "apiVersion": "/v1"
      },
      "metricName": "aws_sqs_approximate_number_of_messages_visible_average_shalb",
      "timestamp": "2021-05-20T08:26:32Z",
      "value": "39",
      "selector": null
    }
  ]
}

It's working.

It's working too!

Thanks for pointing this out, it now documented in the prometheus-adapter docs

--metrics-max-age=: This is the max age of the metrics to be loaded from Prometheus. For example, when set to 10m, it will query Prometheus for metrics since 10m ago, and only those that has datapoints within the time period will appear in the adapter. Therefore, the metrics-max-age should be equal to or larger than your Prometheus' scrape interval, or your metrics will occaisonally disappear from the adapter. By default, this is set to be the same as metrics-relist-interval to avoid some confusing behavior (See this PR).

Note: We recommend setting this only if you understand what is happening. For example, this setting could be useful in cases where the scrape duration is over a network call, e.g. pulling metrics from AWS CloudWatch, or Google Monitoring, more specifically, Google Monitoring sometimes have delays on when data will show up in their system after being sampled. This means that even if you scraped data frequently, they might not show up soon. If you configured the relist interval to a short period but without configuring this, you might not be able to see your metrics in the adapter in certain scenarios.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/prometheus-adapter/issues/392#issuecomment-1077580591): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / prometheus-adapter

custom metrics rule with cloudwatch prometheus exporter #392