nerdswords / yet-another-cloudwatch-exporter

Prometheus exporter for AWS CloudWatch - Discovers services through AWS tags, gets CloudWatch metrics data and provides them as Prometheus metrics with AWS tags as labels
Apache License 2.0
959 stars 330 forks source link

[BUG] Extra samples added when Cloudwatch metric stops #1486

Closed qcattez closed 2 days ago

qcattez commented 1 month ago

Is there an existing issue for this?

YACE version

v0.61.2

Config file

apiVersion: v1alpha1
    sts-region: us-west-2
    discovery:
      exportedTagsOnMetrics:
        AWS/Firehose:
          - Environment
          - Platform
      jobs:
        - type: AWS/Firehose
          regions:
            - us-west-2
          addCloudwatchTimestamp: true
          delay: 600
          period: 300
          length: 1200
          nilToZero: true
          statistics:
            - Sum
          metrics:
            - name: DeliveryToRedshift.Records
            - name: DeliveryToRedshift.Bytes
            - name: DeliveryToRedshift.Success
            - name: DeliveryToS3.Records
            - name: DeliveryToS3.Bytes
            - name: DeliveryToS3.Success

Current Behavior

When configured with a period of 5m and a matching scraping-interval, the last sample of a metric is duplicated by YACE.

Here is the Cloudwatch graph :

Capture d’écran 2024-07-29 à 14 12 54

And here is the Grafana graph :

Capture d’écran 2024-07-29 à 14 13 16

Querying the metric with the AWS CLI shows that no extra samples are present (the duplicated value is 143170.0 on the Grafana graph):

{
    "MetricDataResults": [
        {
            "Id": "m1",
            "Label": "DeliveryToRedshift.Records",
            "Timestamps": [
                "2024-07-29T11:15:00+00:00",
                "2024-07-29T11:10:00+00:00",
                "2024-07-29T11:05:00+00:00",
                "2024-07-29T11:00:00+00:00",
                "2024-07-29T10:55:00+00:00",
                "2024-07-29T10:50:00+00:00",
                "2024-07-29T10:45:00+00:00",
                "2024-07-29T10:40:00+00:00",
                "2024-07-29T10:35:00+00:00",
                "2024-07-29T10:30:00+00:00",
                "2024-07-29T10:25:00+00:00",
                "2024-07-29T10:20:00+00:00",
                "2024-07-29T10:15:00+00:00"
            ],
            "Values": [
                143170.0,
                4615248.0,
                2366956.0,
                3730344.0,
                3370700.0,
                3448014.0,
                3160632.0,
                2367084.0,
                3018033.0,
                2441417.0,
                2579474.0,
                1866146.0,
                2005916.0
            ],
            "StatusCode": "Complete"
        }
    ],
    "Messages": []
}

Note : 1 extra sample is added with a period of 5m. When the period is 1m, 4 extra samples are generated.

Expected Behavior

No extra samples are generated.

Steps To Reproduce

We deploy YACE with kustomize :

kustomization.yaml :

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: cloudwatch-exporter

helmCharts:
  - name: yet-another-cloudwatch-exporter
    repo: https://nerdswords.github.io/helm-charts
    version: 0.38.0
    releaseName: cloudwatch-exporter
    namespace: cloudwatch-exporter
    valuesFile: values.yaml
    apiVersions:
      - monitoring.coreos.com/v1

values.yaml :

fullnameOverride: cloudwatch-exporter

serviceAccount:
  create: false
  name: cloudwatch-exporter

securityContext:
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

testConnection: false

resources:
  requests:
    cpu: 10m
    memory: 128Mi
  limits:
    memory: 128Mi

extraArgs:
  scraping-interval: 300

serviceMonitor:
  enabled: true
  interval: 300s

config: |-
  apiVersion: v1alpha1
  sts-region: us-west-2
  discovery:
    exportedTagsOnMetrics:
      AWS/Firehose:
        - Environment
        - Platform
    jobs:
      - type: AWS/Firehose
        regions:
          - us-west-2
        addCloudwatchTimestamp: true
        delay: 600
        period: 300
        length: 1200
        nilToZero: true
        statistics:
          - Sum
        metrics:
          - name: DeliveryToRedshift.Records
          - name: DeliveryToRedshift.Bytes
          - name: DeliveryToRedshift.Success
          - name: DeliveryToS3.Records
          - name: DeliveryToS3.Bytes
          - name: DeliveryToS3.Success

To generate the manifests : kustomize build . --enable-helm

Don't mind the serviceAccount.create: false, we create it elsewhere with the IRSA annotation.

Anything else?

Hope someone can see help 🙏

kgeckhart commented 1 month ago

👋 @qcattez, the behavior you are seeing is due to the usage of nilToZero: true in your configuration. Removing it or explicitly setting it to false (the default is false) should fix your issue.

What's happening is that CloudWatch continues to export resources in ListMetrics for a variable length of time even when they have no data points. When the exporter calls GetMetricData and no data point is found in the time window nilToZero dictates if a zero should be produced or a nil. Producing zero gives you continuity in your graph panels while nil leaves gaps much like cloudwatch.

qcattez commented 1 month ago

Hi @kgeckhart 👋

Thanks for helping out ;)

I tried setting nilToZero: false but it doesn't change the behavior I mentionned : there is still an extra sample with the same value (not zero). Furthermore, as we can see in the screenshots and graphs, even if I had nilToZero: true, I didn't get a continuity in my graph :/

Do you have another idea ?

qcattez commented 2 weeks ago

Just commenting to get some help on this 🙏

kgeckhart commented 2 weeks ago

You say you want continuity in your graphs and for that to happen you need consistent sampling or prometheus will mark the value as stale. Nils can be connected through settings if you are using grafana. Continuity cannot be achieved if the sample is dropped which is more important?

qcattez commented 1 week ago

I think there's a misunderstanding : my problem is about a wrong extra sample, not about continuity. I annotated the screenshots to be more explicit.

Here are the samples from Cloudwatch :

explained_1

And here are the samples retrieved by YACE and presented in Grafana :

explained_2

We see that the data is wrong with a sample that shouldn't exist. nilToZero has no impact on this behavior.

Can someone reproduce it ? Is it expected looking at my configuration ?

qcattez commented 2 days ago

For anyone stumbling upon the same issue, I finally had my cloudwatch metrics without errors.

Updated fields in my values.yaml :

extraArgs:
  scraping-interval: 60

serviceMonitor:
  enabled: true
  interval: 300s

config: |-
  apiVersion: v1alpha1
  sts-region: us-west-2
  discovery:
    exportedTagsOnMetrics:
      AWS/Firehose:
        - Environment
        - Platform
    jobs:
      - type: AWS/Firehose
        regions:
          - us-west-2
        delay: 600
        period: 300
        length: 300
        nilToZero: true
        statistics:
          - Sum
        metrics:
          - name: DeliveryToRedshift.Records
          - name: DeliveryToRedshift.Bytes
          - name: DeliveryToRedshift.Success
          - name: DeliveryToS3.Records
          - name: DeliveryToS3.Bytes
          - name: DeliveryToS3.Success

A scraping-interval of 60 instead of 300 removed the wrong extra samples at the end. A length of 300 was enough. addCloudwatchTimestamp: true was introducing wrong samples in the middle of the timeserie. nilToZero: true now performs as expected.