prometheus-community / yet-another-cloudwatch-exporter

Prometheus exporter for AWS CloudWatch - Discovers services through AWS tags, gets CloudWatch metrics data and provides them as Prometheus metrics with AWS tags as labels
Apache License 2.0
997 stars 339 forks source link

[BUG] DNSQueries metric returns different values than Cloudwatch #618

Open r0bj opened 2 years ago

r0bj commented 2 years ago

Is there an existing issue for this?

Current Behavior

DNSQueries metric returns different values than Cloudwatch. Let's test metric DNSQueries for particular "HostedZoneId".

config.yml:

apiVersion: v1alpha1
discovery:
  jobs:
  - type: route53
    regions:
    - us-east-1
    metrics:
    - name: DNSQueries
      statistics:
      - Sum
      period: 60

Values are between 1k and 2.75k:

Screen Shot 2022-07-09 at 18 12 23

Let's check Cloudwatch. metric.json:

[
    {
        "Id": "my_dns_queries_id",
        "MetricStat": {
            "Metric": {
                "Namespace": "AWS/Route53",
                "MetricName": "DNSQueries",
                "Dimensions": [
                    {
                        "Name": "HostedZoneId",
                        "Value": "Z080388425YHH2ESICLAB"
                    }
                ]
            },
            "Period": 60,
            "Stat": "Sum"
        },
        "ReturnData": true
    }
]
$ aws cloudwatch get-metric-data --metric-data-queries file://./metric.json --start-time 2022-07-10T00:55:00Z --end-time 2022-07-10T01:10:00Z
{
    "MetricDataResults": [
        {
            "Id": "my_dns_queries_id",
            "Label": "DNSQueries",
            "Timestamps": [
                "2022-07-10T01:09:00+00:00",
                "2022-07-10T01:08:00+00:00",
                "2022-07-10T01:07:00+00:00",
                "2022-07-10T01:06:00+00:00",
                "2022-07-10T01:05:00+00:00",
                "2022-07-10T01:04:00+00:00",
                "2022-07-10T01:03:00+00:00",
                "2022-07-10T01:02:00+00:00",
                "2022-07-10T01:01:00+00:00",
                "2022-07-10T01:00:00+00:00",
                "2022-07-10T00:59:00+00:00",
                "2022-07-10T00:58:00+00:00",
                "2022-07-10T00:57:00+00:00",
                "2022-07-10T00:56:00+00:00",
                "2022-07-10T00:55:00+00:00"
            ],
            "Values": [
                33.0,
                3882.0,
                4296.0,
                4439.0,
                4273.0,
                4437.0,
                4675.0,
                5112.0,
                4443.0,
                4202.0,
                3842.0,
                4289.0,
                4143.0,
                5414.0,
                4376.0
            ],
            "StatusCode": "Complete"
        }
    ],
    "Messages": []
}

Values in Cloudwatch are more or less 40% higher.

Expected Behavior

DNSQueries metric returns the same values as Cloudwatch.

Steps To Reproduce

No response

Anything else?

No response

cristiangreco commented 2 years ago

Hi @r0bj, thanks for filing this issue! Would you please be able to run yace with the -debug flag so that it logs AWS requests and responses? It'd be good to compare those with the aws-cli output.

r0bj commented 2 years ago

@cristiangreco I've used --debug parameter and I think I have a hypothesis why values reported by yet-another-cloudwatch-exporter are usually lower than what Cloudwatch returns.

There are couple of measurement iteration examples with --debug enabled below.

Measurement 1:

      <member>
        <Timestamps>
          <member>2022-07-11T22:00:00Z</member>
          <member>2022-07-11T21:59:00Z</member>
          <member>2022-07-11T21:58:00Z</member>
          <member>2022-07-11T21:57:00Z</member>
          <member>2022-07-11T21:56:00Z</member>
        </Timestamps>
        <Values>
          <member>2198.0</member>
          <member>4118.0</member>
          <member>4591.0</member>
          <member>4487.0</member>
          <member>4443.0</member>
        </Values>
        <Label>Z080388425YHH2ESICLAB</Label>
        <Id>id_8757155226067830732</Id>
        <StatusCode>Complete</StatusCode>
      </member>

Measurement 2:

      <member>
        <Timestamps>
          <member>2022-07-11T22:02:00Z</member>
          <member>2022-07-11T22:01:00Z</member>
          <member>2022-07-11T22:00:00Z</member>
          <member>2022-07-11T21:59:00Z</member>
          <member>2022-07-11T21:58:00Z</member>
        </Timestamps>
        <Values>
          <member>2450.0</member>
          <member>4300.0</member>
          <member>4821.0</member>
          <member>4118.0</member>
          <member>4591.0</member>
        </Values>
        <Id>id_942516886148531175</Id>
        <Label>Z080388425YHH2ESICLAB</Label>
        <StatusCode>Complete</StatusCode>
      </member>

Measurement 3:

      <member>
        <Timestamps>
          <member>2022-07-11T22:04:00Z</member>
          <member>2022-07-11T22:03:00Z</member>
          <member>2022-07-11T22:02:00Z</member>
          <member>2022-07-11T22:01:00Z</member>
          <member>2022-07-11T22:00:00Z</member>
        </Timestamps>
        <Values>
          <member>2717.0</member>
          <member>4513.0</member>
          <member>4603.0</member>
          <member>4325.0</member>
          <member>4821.0</member>
        </Values>
        <Label>Z080388425YHH2ESICLAB</Label>
        <Id>id_267650114760936840</Id>
        <StatusCode>Complete</StatusCode>
      </member>

It seems that yet-another-cloudwatch-exporter just takes values from last available time period from every measurement, so: Measurement 1 - 2198.0 Measurement 2 - 2450.0 Measurement 3 - 2717.0

Unfortunately, this is not accurate because values from the last available time period are always lower (at least for Sum). This is because the time period boundaries of CloudWatch and the exporter are not in sync. In practice, this means that the value reported by the exporter is almost always smaller than what CloudWatch is reporting.

In this case, yet-another-cloudwatch-exporter should probably use values from the previous (last - 1) available time period.