Open jdix531 opened 1 year ago
I've had the very same questions myself. After some investigation, I've created some slides to use internally in my organization. I'm happy to share them here - and if appreciated, I can contribute with PRs to the documentation to include these explanations.
I'll wait for some feedback from the maintainers. Let me know if some of the info above is incorrect and I'll fix it. Whenever the info is correct, I can create the PRs to include this in the docs if desirable 😄
I'm not a maintainer but have been working with YACE/CloudWatch for a bit. @PerGon that's a great breakdown of the parameters and those tips and tricks are spot on!
There's an extra piece that YACE supports, roundingPeriod
, which is good to consider. The GetMetricAPI docs have a recommendation for performance,
For better performance, specify StartTime and EndTime values that align with the value of the metric's Period and sync up with the beginning and end of an hour. For example, if the Period of a metric is 5 minutes, specifying 12:05 or 12:30 as EndTime can get a faster response from CloudWatch than setting 12:07 or 12:29 as the EndTime.
Since it's incredibly unlikely that your scrapes are going to align with this recommendation you can set roundingPeriod
= period
. This will ensure that if you scrape metrics at 7:07PM with a length
and roundingPeriod
of 5m the requested interval will be 7:00-7:05PM. The other added benefit of this approach is that data exported by YACE should be able to be reproduced in CloudWatch. CloudWatch will only graph data according to these best practices so if you graph a metric with a period of 5m in CloudWatch you are going to see data points on the 5m mark.
@PerGon, I love your slides they're really useful. I've opened today this issue/improvement (#985) which it's kind of related to this. My understanding, at least when scraping normal metrics (including average), is that YACE will always pick the latest one, see: https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/c7807a770bb427f8ddb2c7becac51185fb3e8230/pkg/clients/cloudwatch/v1/client.go#L124-L125
@PerGon, thanks for sharing your slides! Could you provide the link for the "Cloudwatch API has a delay in exposing metrics" line?
(sorry the delay, was on vacations)
@PerGon, thanks for sharing your slides! Could you provide the link for the "Cloudwatch API has a delay in exposing metrics" line?
Sure, that link is pointing to this datadog documentation
The "multiple data points per scrape" behavior seems wrong to me. It appears to make period
pointless, and also causes data loss for sum
metrics.
Quoting the screenshots above, emphasis mine:
How YACE deals with multiple data points per AWS CloudWatch scrape
When YACE does 1 cloudwatch scrape, it can, possibly, get various points for the same metric. But YACE will only expose one single value. So this is what it does to reach the value to expose (source code):
If "statistic" is of type "Average" YACE will calculate the average of all those data points and use that average
If any other "statistic" (Maximum, Sum, etc) Use the value of the most recent data point
Firstly, what is the point of period
if YACE is going to merge all those datapoints down to one? Why request a lower period
granularity than scraping-interval
if YACE can only submit a single datapoint per scraping-interval
?
Secondly, if YACE uses the most recent value, and discards all others, doesn't that result in data loss? If a "sum" metric counts how many of a given thing occur in a given time interval, then we must sum all datapoints to aggregate smaller time intervals into a larger one.
For example, suppose I have this config:
YACE will, every 30s, retrieve the past 30s of datapoints with a 10s resolution (3 unique datapoints)
I expect/want YACE to do this: Export all 3 datapoints, allowing my downstream dashboard tools to sum them into a total.
But based on what I'm reading, YACE actually does this: Submit the last datapoint, discard the first 2.
This means that anything occurring in seconds 0 through 20 will be ignored. It will only export the sum of things which happened from seconds 20 to 30 in the interval: the last datapoint.
found this while searching for some info on why the metric values are unexpected.
I don't think it simply takes the last data point.... with sum, i think it sums all the datapoints.
Also if only the last datapoint was considered then max/min would always be the same, no? or am i missing something ? EDIT: maybe YACE is applying the statistic ON the set of "last data point". That would make sense and the length serves only to specify how far in the past to look for data points
I don't think it simply takes the last data point.... with sum, i think it sums all the datapoints.
I wish you were correct, but I'm worried because the slides shared above directly contradict that.
And in this code, I'm seeing that it drops all datapoints except the most recent one. https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/clients/cloudwatch/v2/client.go#L145-L156
I certainly hope I am misunderstanding the code, but so far it seems like it's dropping all but the last data point.
Another concerning tidbit from the code: There is code to aggregate datapoints for GetMetricStatistic
calls, but the GetMetricData
codepaths do not do that!
GetMetricStatistic
codepath:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/promutil/migrate.go#L155-L186
GetMetricData
codepath, single datapoint, I think this relates to the code linked in my previous comment:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/promutil/migrate.go#L149-L151
Thanks for your reply.
I never used Go but if it is anything like C, I think it is getting the address of the first element in the array no ?
mappedResult.Datapoint = &metricDataResult.Values[0]
so it depends on what the caller is doing with mappedResult
I will try to get some time over the weekend to look better
Hi, I am trying to make sense of how to best use these options to enable monitoring for my infrastructure as best as possible.
scraping-interval seems the most obvious, which is the frequency in which you'll refresh all metric data aka hit the aws cloudwatch api for data.
Period and length confuse me a bit more and seem to make sense only for metrics with averages.
If you're not looking at average metrics however it seems that you'll be limited to the scraping interval for overall accuracy.
Can someone explain this a bit more? If I can be made to understand it I'll put a PR for the docs!
The relation between the 'job-default' and 'metric-default' will likely make sense when this is explained where I imagine the more atomic user-defined value is accepted.
Please and thank you