Possible bug: With win_perf_counters some timestamps are late, causes graphs to show missing data

AlexHeylin commented 4 years ago

Issue: Possible bug / unexpected behaviour

Telegraf version: 1.14.2 (also 1.13.something) Inputs: win_perf_counters, also exec on some servers. Output: InfluxDB (production), file (testing)

Scenario

Telegraf is set to poll a number of win_perf_counters and send to InfluxDB every 10 seconds. Grafana is drawing graphs based on this data, some of those graphs use GROUP BY($Interval) and $interval = 10s

Symptom

First seen as every 70 seconds a data point was missing from all graphs for a host at some time periods. This doesn't always happen, and can occur on machines where this had been working fine before. Sometimes this is triggered by Telegraf service was restarted - then there is a chance this can happen. It has also suddenly started on machines without Telegraf being restarted. This affects all data from win_perf_counters.

Telegraf-InfluxDB-Grafana-01 Telegraf-InfluxDB-Grafana-02

Behaviour

Some data shown as missing on graphs, because the timestamp is later than it should be. This causes the "missing" data to fall over the boundary of the GROUP so you get a "missing" result, and the next result is made up (averaged) from two data points.

Expected behaviour

Graphs do not show data as missing, because data is written with timestamps which are 10 seconds apart (not 10 - 11 seconds apart).

Investigations done:

This doesn't just affect machines using exec input, so exclude that as cause. This can occur apparently without Telegraf service being restarted, however restarting service may cause this to occur on machines which were not previously showing this issue (was not evident on graphs)

Check timestamps on data written to Influx and file, and observe that sometimes data is written with a timestamp one second later than it should be. This seems particularly prevalent if the data is due on the 09 / 19 / 29 / 39 / 49 / 59th seconds of the minute. It may get written as x0 or occasionally x1 instead.

win_perf_counters should be reading the timestamps from the perf counters (default setting) so it's not clear how these can be out by a second. Possibly due to rounding to 1s accuracy.

InfluxDB queries which show this issue

Grafana

SELECT mean("Available_Bytes") as "Available Bytes", mean("Pool_Nonpaged_Bytes") as "Pool Nonpaged Bytes", mean("Pool_Paged_Bytes") as "Pool Paged Bytes" FROM "win_mem" WHERE host =~ /$hostname$/ AND $timeFilter GROUP BY time($interval), host ORDER BY asc

Simplest form

SELECT mean("Available_Bytes") as "Available Bytes" FROM "win_mem" WHERE host = 'MyHostName' AND time > now() - 1h and time < now() -10s GROUP BY time(10s)

Relevant settings:

Here's what I think is relevant. Can post full files if needed, but don't want to clutter the post.

[agent]
  interval = "300s"
  round_interval = false
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""

[[inputs.win_perf_counters]]
  interval = "10s"
  UseWildcardsExpansion=true

danielnelson commented 4 years ago

Timestamps are set by the input plugin, and different plugins use different methods for setting timestamp. The agent precision setting, by default 1s, is then used to round the timestamp. So, to really understand the timestamps you need to look at the particular plugin.

The win_perf_counters plugin has two modes, depending on if UsePerfCounterTime=true is set:

When set, the time is provided by the performance counters
Otherwise it is set to the current time immediately before making the query.

The exec plugin uses a time depending on which data_format you have set, some of the formats read the time from the data, while others use the current time. When using the current time the time it set when the data is read from the exec script, so it will depend on how long the script takes to write the data.

Since real time is used in both cases, and then the time is rounded, it is definitely possible for the time to round up or down by a second, just due to normal difference in how long operations take to run.

You may want to try enabling this setting UsePerfCounterTime in the plugin, or use the round_interval setting in the agent. Another method for avoiding this is to group over a time period longer than the collection interval to ensure you have some points.

I hope this helps explain the behavior you are seeing, please let me know if you have any follow up questions.

AlexHeylin commented 4 years ago

Thanks @danielnelson - I didn't have UsePerfCounterTime set, so it's defaulted to true. Based on your description, particularly that if UsePerfCounterTime=false then timestamp is set based on time before starting the query, I've set UsePerfCounterTime=false and will see if this resolves it. That should remove any variance due to time taken to refresh counters, expand wildcards, and get the data.

AlexHeylin commented 4 years ago

UsePerfCounterTime=false does not resolve this issue. That suggests that some time is being "lost" between the beginning of the interval, and the timestamp collection actually occurring, even when not reading the timestamp from the query.

Investigating the code in win_perf_counter.go, there seems to be something slightly strange going on. Please bear in mind I don't know Go, so I'm having to interpolate a bit when reading - apologies for what I get wrong!

Relevant setup:

UsePerfCounterTime=false
interval="10s"
UseWildcardsExpansion=true
CountersRefreshInterval="1m"

What I see happening in the code

Every interval Gather is called, which starts on line 344. If counters were last refreshed more than CountersRefreshInterval time ago, then refresh the counters (this uses up time since Gather was called) and then sleep for 1 second. Then either collect time.now for timestamps, or query them while collecting data. However if a counter refresh happened, we're now running one second + time taken to refresh counters late and so is the timestamp we collect (either method)

So that would explain why every CountersRefreshInterval the data is late. However what happens is every CountersRefreshInterval + interval the data is late. It seems this is because m.lastRefreshed is set to the time after the refresh has happened. Due to time taken to run refresh, this causes the refresh to actually occur not every CountersRefreshInterval, but every CountersRefreshInterval + interval. i.e. this can cause the actual occurrence of counter refresh to "process" forward by one interval each time it's refreshed. Thus for CountersRefreshInterval="1m" what effectively happens is CountersRefreshInterval="70s"

This is compounded by the use of m.lastRefreshed.Add(m.CountersRefreshInterval.Duration).Before(time.Now()) - note that Before will do < not <= which will stop a refresh occurring exactly CountersRefreshInterval after the last one completed.

So, I see three potential sources of jitter / drift here;

Jitter of data timestamps

The use of time.Sleep(time.Second) on line 366 to force a one second delay each CountersRefreshInterval. This is almost certainly the biggest culprit. Unfortunately there's no comment to say why it's there.
Processing the counter refresh before collecting the data, which delays data collection. Perhaps after the first run the counters could be refreshed in the idle time after data collection, so they're already done ready for the next collection?

Drift of counter refreshing

As written, the counter refresh will drift ahead by at least the time taken to run the refresh each time a refresh occurs.

I'm happy to have a go at rewriting this and submit a PR, but I'd value someone verifying my analysis before I do - to make sure I've understood the problem. Thanks

AlexHeylin commented 4 years ago

Results of initial testing. Removing the one second sleep seems to make no difference.

The delayed data interval varies approximately inline with CountersRefreshInterval + interval. Verified by changing CountersRefreshInterval to 25s, 60s, 88s.

UseWildcardsExpansion=false seems to make no significant difference.

Having the counter refresh occur after the data poll does seem to resolve the issue. Will continue testing.

sspaink commented 2 years ago

@AlexHeylin were you able to find a solution for this that worked for you? Are you still having this problem?

AlexHeylin commented 2 years ago

@sspaink - I'm going to have to investigate this a bit because my last message above suggests I built a custom build of telegraf.exe to test / resolve this, but the matching commit in my internal fork of this seems to only contain config updates, and the commit before that (6 days earlier) was me upgrading the telegraf.exe binary to 1.14.2. I'm confused at this point, and I'll need to try to work out what I did and when / how, and if it resolved the issue.

Looking at the data in our production system, it looks like I did resolve this - but I'm not sure how. I'll try to come back with a confirmed answer.

DStrand1 commented 2 days ago

@AlexHeylin is this still an issue? Do you have any relevant log files if so?

AlexHeylin commented 19 hours ago

Unfortunately I can't add anything further as I no longer work with this code, or even in the IT sector. Given how well it appears I investigated this, I'm genuinely sorry I can't help further because apart from anything else I'd like to know the cause and if I was correct or completely wrong.

influxdata / telegraf