nerdswords / yet-another-cloudwatch-exporter

Prometheus exporter for AWS CloudWatch - Discovers services through AWS tags, gets CloudWatch metrics data and provides them as Prometheus metrics with AWS tags as labels
Apache License 2.0
972 stars 332 forks source link

[FEATURE] Support to expose historical data points #985

Open luismy opened 1 year ago

luismy commented 1 year ago

Is there an existing issue for this?

Feature description

Context

In the context of jobs, Yace allows scraping CloudWatch metrics, and the frequency and amount of data points are controlled by:

All the examples in the examples folder, show how to scrape different types of metrics and the period and length happen to be always the same. I'm currently working with a client who adopted YACE to scrape metrics and we configured it to get data points with a granularity of 1 minute and 5 minutes worth of data.

Take the following yace-config.yaml as a concrete example:

apiVersion: v1alpha1
sts-region: eu-west-2
discovery:
  jobs:
  - type: alb
    regions:
    - eu-west-2
    period: 60
    length: 300
    addCloudwatchTimestamp: true
    dimensionNameRequirements:
    - LoadBalancer
    metrics:
    - name: TargetResponseTime
      statistics: [Average]

Expected Behaviour

We run YACE, via docker-compose, using the configuration above and we were expecting to see the following:

I spent a fair amount of time trying to understand if we misconfigured YACE but after enabling the debug logs, spending a fair amount of time looking at the source code, and running YACE locally I realised that's how is currently implemented. See https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/c7807a770bb427f8ddb2c7becac51185fb3e8230/pkg/clients/cloudwatch/v1/client.go#L120-L128

I believe this is NOT a BUG but a design decision to only include the latest data points and not any historic data as Prometheus will reject any samples that are too old.

Having said that, I believe that it might be really useful to be able to include historic data points if the period is small and the length does not go too far (making sure Prometheus is still happily ingesting all the samples)

New Feature

Implementing the approach above could be a bit controversial, and it will introduce breaking changes (or extra behaviour when the length is bigger than the period setting). Therefore, I'm proposing to introduce a new feature, e.g. -enable-feature=allow-multiple-datapoints or -enable-feature=allow-historical-datapoints whereby all the data points are included. So, in the example above, rather than getting the latest data point, the metrics page will include the 5 fetched from the CloudWatch API

What might the configuration look like?

No extra configuration would be required as I'm suggesting controlling this new behaviour globally via a new feature flag but I'm open to suggestions as it could be useful to have a fine-grain control by including an extra config setting at the job level

Anything else?

I'll include more details in terms of images and logs later on but I'm happy to give it a go, although it's been a while since I've done any serious development and I don't have huge experience with golang 😅

luismy commented 1 year ago

I've forked this repo and I've been playing with the code to see if I can include historical metrics. Hopefully, I'll have something to push in a branch and maybe a draft PR. I think it will make more sense to have the addHistoricalMetrics flag at the job level and not as an extra feature flag as you might only want to enable this for a specific job. I've been looking at how the addCloudWatchTimestamp has been implemented.

cristiangreco commented 1 year ago

@luismy I'm not sure I understand your use case. YACE scrapes cloudwatch in the background, with a frequency determined by the flag -scraping-interval (by default 300): if you lower that to 1 minute, it would match the prometheus scrape interval and you would be getting the latest datapoint at every scrape.

luismy commented 1 year ago

@cristiangreco we initially did that to get the same level of metrics as CloudWatch but we want to have a more flexible control at the expense of metrics being "late". So, ideally, I want to keep the scraping interval high but being able to go far in the past so the metrics scrape by our Prometheus also fetches the "historical" metrics. By always having the "latest" datapoint we won't be able to achieve that.

I totally understand that this is a specific edge case when you're trying to "mirror" CloudWatch metrics with the original timestamp but I think it will also give you flexibility if you want to push "high resolution" metrics from CloudWatch (i.e. anything under 1 minute). With the current approach if I have metrics produce at, for example, every 15 seconds. I would have to configure the scraping interval to at least 15 seconds.

Also, from our testing, we have seen situations where if the metric is produced every minute, setting the default interval to 1 minute too might not be enough (usually the interval should be set to half of the frequency of the original source, in this case 30 seconds). Also we have seen situations whereby the CloudWatch metric is not ready at that specific time so when the next scrape interval happens it will be taken the latest one but you would have missed the previous data point.

I know the current PR I currently have requires more work which I'm happy to complete. Just wanted to double check with you guys if you see it as a good feature to be included in the main release

gavin-jeong commented 1 year ago

I think this is worth having. From k8s, setting multiple scrapping-interval for each job is quite dirty. It may require having multiple YACE instances. If we can take multiple time series from single Prometheus scrapping, We can use a single YACE instance with various metric resolutions for each job in it.

luismy commented 1 year ago

If you guys think this feature has "legs" I'll be more than happy to progress it further this month. I have a list of things to improve (implementation-wise)

I've tested the fork/feature branch for our use case I can confirm that the visualizations in Grafana look exactly the same as the ones you get in AWS CloudWatch metrics

NickAdolf commented 6 months ago

This fits our desired usage pattern as well. We would rather save the CloudWatch expense and scrape every 5 minutes, accepting the lag. That said, it would be ideal to have 60 second granularity from that 5m scrape. Realistically requiring accepting out of order metrics on scrape as well.