influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.69k stars 5.59k forks source link

Timestamp of captured data on SNMP plugin #3598

Closed lfdominguez closed 6 years ago

lfdominguez commented 6 years ago

I have a question about the data obtained from the router, when the SNMP plugin get data from the routers and send to for example to influxdb, what timestamp put into this data? If the timestamp is from where telegraf is running and the router is slow to respond to snmp queries then you have a wrong metric because the timestamp when arrive to telegraf host is different that when the router generate the SNMP response. I think that is better get the timestamp from the system time OID. What do you think?

danielnelson commented 6 years ago

It adds the time according to Telegraf after the GET/WALK completes. The problem you are describing does happen, but it seems we would still have the same issue if we used the remote snmp agents system time.

lfdominguez commented 6 years ago

Well, in my case, before that I use Telegraf for this work i code a Ruby program to catch the data and the systemdate and work very well, because the router put that information together with the data. The problems that I'm monitoring the link of my ISP and with my ruby code and manual calculations sometimes I get the 2mbps that I have contracted, but with Telegraf I see that the link don't pass the 1.8mbps, but I can not make a complaint that when they come to check (with PRTG), then they line them up correctly.

So if the snmp plugin can add a option to optionally calculate the timestamp from the router, for example, using the sysUpTime....

  1. 1st request: catch the sysUpTime only
  2. 2nd request onwards : catch the data and calculate the difference with the previus sysUpTime to put the real timestamp....

All the problems of course is by a low response of the router or a delay of the snmp response package to the Telegraf host, but if we use the sysUpTime, then this error is mitigate because dont care the time elapsed since the snmp response go out from the route and arrive to the Telegraf host, because into the response is the correct timestamp.

What do you think about that??

phemmer commented 6 years ago

The problems that I'm monitoring the link of my ISP and with my ruby code and manual calculations sometimes I get the 2mbps that I have contracted, but with Telegraf I see that the link don't pass the 1.8mbps

I think something else is going on here. If this were simple timing inaccuracies, then if one interval were reported lower than it actually is, the next interval would be higher than it actually is. Since you don't describe this happening, I suspect timing issues are not the cause.

lfdominguez commented 6 years ago

An example of the timing issue, if the router get de data and send with a time of 5s, 12s, 17s and 24s... But when telegraf get them in 5s, 14s, 18s and 24s; then the info of data transfered between for example 5s and 12s telegraf get as if the data is from 5s and 14s, that is same data in more time that is less data rate, not is same 200kbytes in 7s that is 200/7=28k/s... Than 200kbytes in 9s that is 22k/s, a difference of 6k/s.

lfdominguez commented 6 years ago

In my router the SNMP request always has 1 or 2 seconds of lag, that collecting the time that say the router in the SNMP response don't care the network lag.

phemmer commented 6 years ago

Are you saying your router updates the counters on a specified interval, and caches the values on subsequent requests, thus telegraf is not fetching an instantaneous value?

In regards to your example, I'm not completely understanding the 5s/12s/14s thing, but what I was saying in my previous comment that even if one interval is low, the next interval would be high. The data doesn't just go missing.

Example:

> select v from test
name: test
time        v
----        -
0           0
12000000000 200000
20000000000 400000
30000000000 600000

This simulates receiving 200kb every 10 seconds, but that second point was recorded as 12s instead of 10s.

> select derivative(v, 1s) from test
name: test
time        derivative
----        ----------
12000000000 16666.666666666668
20000000000 25000
30000000000 20000

Notice how the first point is low (less than 20kb/s), but the second point is high (over 20kb/s). If the timestamps were simply being recorded at the wrong time, this is what you would see. But you are saying that you never see > 2mb/s, thus timestamps cannot be your issue.

lfdominguez commented 6 years ago

Thanks for the interest!!!! Let me explain in a simple way.... If was configured telegraf to get with an interval of 10s.

  1. Start at 13:00:00
  2. Telegraf send a request 1.1 The router in the snmp response say that systime is 13:00:10
  3. Telegraf receive the response with a consume of 200k for example 2.2 But the router send that response with 2 sec of lag for example (that is average of my router, it's a very old SHDSL router) 2.3 And Telegraf receive the snmp package at 13:00:12
  4. Telegraf save to influxdb that at 13:00:12 has consumed 200k, when the real is that this 200k is consumed at 13:00:10

So if the router has an average"s lag of 2 seconds, then Telegraf don't see this because it save the data with the current timestamp that the snmp response arrive to the server.

Sorry if you dont understand, is my bad english....

lfdominguez commented 6 years ago

The router always respond with a delay because is slow, so the time always is defaced.

brandond commented 6 years ago

It sounds like the ask here is to take RTT into consideration when setting the timestamp - for example, if the request is sent at 13:00:00, and the response is received at 13:00:04, then set the timestamp to 13:00:02. This would naively expect that the upstream and downstream latencies are identical, and that the delay is purely due to the network (and not due to something like an underpowered router)... but it would probably be more accurate than timestamping it at the moment the response is received.

lfdominguez commented 6 years ago

but in a simple way, the router bright to you the time where it get the data, so the router is "who known" when the data is captured

danielnelson commented 6 years ago

This simple method would give the router's time when we start querying, this wouldn't be the time that the other fields are sampled at though, so I don't see it being more accurate than using Telegraf's time after the field value is received. It will be just as far off, but in the opposite direction, and we have to do the extra network call.

Since the time is consistently behind each interval should contain the right delta (over time). You might need to aggregate over a larger period to reduce noise.

lfdominguez commented 6 years ago

Well the implementation that when the time is set into the response (if after or before) that's only is knowed by HUAWEI (that's my router), but do you think that... really you think that the time of the router is less accurate of the time when telegraf receive the response????

danielnelson commented 6 years ago

It seems to me that the router would report too early of a time while Telegraf reports to late of a time, but they should both be equally inaccurate. Unless we did something like @brandond suggested it wouldn't be an improvement in accuracy, plus it would be slightly more expensive, unreliable, and complicated.

danielnelson commented 6 years ago

I'm going to close this issue, I don't think we should attempt this change.