sflow / host-sflow

host-sflow agent
http://sflow.net
Other
146 stars 55 forks source link

sys uptime should be more accurate #3

Closed elianka closed 8 years ago

elianka commented 8 years ago

hi, the member "bootTime" in struct SFLAgent is defined with time_t, and it will be multiplied by 1000 when fill in the sys uptime field of datagram. In the sflow spec 5, sys uptime should be calculated with ms. so i think it will affect the accuracy to compute counter rate in collector. i suggest to modify it to timeval. untitled

sflow commented 8 years ago

Thanks for the input.

I agree that it should probably be a timeval, but I'm not sure it helps to use this field for computing counter rates at the collector. The sFlow spec allows the datagram to linger for up to a second before it is sent out, so that it can fill up with multiple packet/counter samples. This timestamp is only added at the point where the datagram is actually sent. A "time-dither" of about 1 second is therefore present no matter what you do after that. This tends to be much larger than the transport and stack delays so it's really just as good to timestamp on receipt at the collector. Better the clock you know...

Bear in mind that the counters may not be updated instantaneously by the underlying hardware/OS anyway.

Neil

elianka commented 8 years ago

i fully agree that a second has no matter when the interval is a little larger. meanwhile when i want to use the counters more real time, such as equal or less than a second, it may be a problem.

i guess the collector will calculate traffic rate using the delta of counters(C1-C2) to divide by the delta of sys uptime(t1-t2). So, when (t1-t2) is smaller enough, the accuracy of t1 and t2 will influence the whole result.

i am using sflow to develop a POC(proof of concept), and the monitored nodes are real devices(switch). And i need to monitor the real-time traffic rate on each port/vlan in an interval less than a second. So i think the sys uptime will influence the result of the traffic rate.

sflow commented 8 years ago

Some switches only pull counters from ASIC hardware every few seconds anyway, so using the counters for sub-second analysis may be asking too much of them. The good news is that you don't have to go far to find an answer that does work: process the sFlow packet-samples using sFlow-RT.

It's a slightly surprising inversion, but at shorter time-frames the packet samples can provide a more responsive signal than the counters. Especially if you are more interested in places where the traffic level is high. The graph in this sFlow-RT app illustrates the effect:

http://blog.sflow.com/2015/11/sflow-test.html

Full disclosure: sFlowRT is a product from my company. However it's free for research, community supported, and all the apps are open-source. http://sflow-rt.com.

elianka commented 8 years ago

yeah, i have used sFlow-RT for collector, that's great software. meanwhile sflow-rt java lib code is not opened.

i don't think packet samples can provide a more responsive signal than the counters when we want to know the real traffic rate based on port/vlan. Because the packet samples are based on sampling, and you could get the error rate mapping on this link http://www.sflow.org/packetSamplingBasics/.

The counters is the accurate data on how many packets, how many octets have been transmitted. Yes, its a slightly inversion, but its another situation should be considered, i think. And if we want to cover this situation, i think we should modify the implementation of host sflow, the original tick in agent is based on 1 second, so it is not capability to do a shorter-time-frame sampling. :)

sflow commented 8 years ago

I think you'll find that sFlow-RT does rather better than those equations would imply, and it's not all that surprising when you consider that receiving 10 packet-samples/sec from a port is really giving you 10 polls/sec on the "sample_pool" packet-counter for that port. The packet-samples fill the datagrams faster too so they get flushed out with minimal delay, taking the latest counter-samples with them (which makes those counter-deltas more accurate too). But not knowing your PoC requirements I can't really comment further. For example, if you need sub-second response even at low traffic levels then you may have to try another approach.

The sFlow standard was written to avoid making unrealistic or unnecessary demands on hardware. Most use-cases do not require accurate readings at sub-second granularity during low traffic levels, so that's why "read-time" timestamp fields are not in the base structures.

I don't know what else to suggest for the switch, but with hsflowd there is nothing stopping you from making changes. You might attach an extra structure to each counter sample -- tagged with your enterprise number -- that contains only a timestamp to indicate when the counters were read from /proc. That way the feed would still comply with the sFlow standard. It would just have a small private extension that other collectors would ignore.

elianka commented 8 years ago

so great to know the reason why "read-time" timestamp fields are not in the base structures. That's what i have considered. i agree that. extra data will not not be parsed by collectors like sflow-rt. i think i should modify some code to adjust my rarely happened part. Thanks all the same.