influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.54k stars 5.56k forks source link

inputs.cloudwatch: query metrics from [$delay:$now] interval #15415

Open redbaron opened 4 months ago

redbaron commented 4 months ago

Use Case

Cloudwatch can be very instable in delivering metrics in time: delays can be from minutes to half an hour in some cases. If one wants to avoid gaps in collected metrics, then delay param has to be set sufficiently back in time to cover possible delays. It is then calculates delayed timestamp as $delay = $now - delay.

Telegraf then queries metrics in range $delay:$delay+period every interval time. It makes metrics always delayed even if Cloudwatch have fresh data. In other words delay is set to cover worst case, but it penalizes best case by doing so.

Expected behavior

It would be good if cloudwatch plugin could be configured to fetch metrics in $delay:$now interval, this will allow $delay to be set sufficiently back in time to cover occasional late metrics delivery, yet have freshest possibly data if Cloudwatch has it.

Obviosuly same Cloudwatch data point will be fetched multiple times, which can incur costs , but it is a tradeoff telegraf users might be willing to take.

Actual behavior

telegraf cueries single datapoint in $delay:$delay+period range, thus missing fresher data even if it exists.

Additional info

No response

powersj commented 4 months ago

Hi @redbaron,

Today we have ~three~ two time-based config options period and delay:

  ## Requested CloudWatch aggregation Period (required)
  ## Must be a multiple of 60s.
  period = "5m"

  ## Collection Delay (required)
  ## Must account for metrics availability via CloudWatch API
  delay = "5m"

Correct me if I am wrong, but your proposal would change how additional windows are calculated. Rather than basing them off a previous interval's last end, you want always go from the start-period like we do for the first interval?

I'm looking at the code for updateWindow.

redbaron commented 4 months ago

my proposal is to always have windowEnd = $now . I don't fully understand how windowStart is calculated there, but it should be roughly $now - delay at every gather.

powersj commented 4 months ago

Ah ok thanks for clarifying. I assume we could have a window_end_mode option, where we have your new option "now" and a default option with "delay"?

Something you would be willing to put up a PR for?

redbaron commented 4 months ago

Something you would be willing to put up a PR for?

Hi, unfortunately I wont be able to dedicate time for it at the moment.