path-network / logstash-codec-sflow

Logstash codec plugin to decrypt sflow
Other
35 stars 17 forks source link

sFlow metrics are not accurate #25

Open maintain3r opened 4 years ago

maintain3r commented 4 years ago

Despite the fact that sflow protocol is sample based and it highly depends on sample-rate parameter (lesser the value is better precision you get) some commercial tools can show ~99.9% accurate results. I have elastiflow+logstash-sflow-plugin and telegraf-snmp + prometheus + grafana installed. I use these 2 setups to compare results. And to have apple with apple comparison I get values from both setups for a specific interface which carries just a single public ip traffic. Counters (in/out) gathered through SNMP and those gathered from sflow (logstash sflow plugin) are very different. Even you have no telegraf-snmp and all that stuff, it's easy to check, just with the following script:

#!/bin/bash
#
#
delay=60

function getbytes_out()
{
        snmpwalk -v3  -l authPriv -u <myuser> -a SHA -A <super@uth>  -x AES -X <secretp@ssword> <device name or ip> IF-MIB::ifHCOutOctets.<snmp ifIndex number> | awk '{print $4}'
        return $?
}

function getbytes_in()
{
        snmpwalk -v3  -l authPriv -u <myuser> -a SHA -A <super@uth>  -x AES -X <secretp@ssword> <device name or ip> IF-MIB::ifHCInOctets.<snmp ifIndex number> | awk '{print $4}'
        return $?
}

while true
do
    bytes_start_out=$(getbytes_out)
    bytes_start_in=$(getbytes_in)
    sleep ${delay}
    bytes_end_out=$(getbytes_out)
    bytes_end_in=$(getbytes_in)
    result_out="$(($bytes_end_out-$bytes_start_out))"
    result_in="$(($bytes_end_in-$bytes_start_in))"
        speed_out=`echo "scale=2; $result_out / ${delay} *8/1000/1000" | bc`
        speed_in=`echo "scale=2; $result_in / ${delay} *8/1000/1000" | bc`
    echo "AVG Speed (Mbps) for last ${delay}s: ${speed_in} In  ${speed_out} Up"
done

exit $?

The values I get from prometheus and the script above are very very close. But again, the values from prometheus(telegraf-snmp)/script above and those from logstash are very different. To give you an idea, for a 1G uplink, elatiflow+logstash shows me smth like 1.6G of utilization in out direction (a have filters in place that flow.src_addr or flow.dst_addr to distinguish in/out traffic). For in direction is the same thing. And the thing is that there's no pattern, I tried to find a correlation or a ratio between results but didn't find it. :( My cisco nexus devices don't allow me to set a sampling-rate lower than 4096 and this is what negatively impacts the precision. But even with that said, some companies overcome this by leveraging interface counters. Here's an interesting explanation about sFlow accuracy on plixer.com (their commercial product called scrutinizer) https://www.plixer.com/blog/why-doesnt-sflow-look-accurate. From what I understand, they say that in order to make sFlow collector more accurate, they use interface counters (which are part of sflow i.e. COUNTERSSAMPLE coming from sflow device) and all FLOWSAMPLE counters they fit into the amount of octets passed through that particular interface. COUNTERSSAMPLE is a special part of sflow protocol which allow you to bring not just bw utilization, but other metrics too, e.g. cpu, mem, etc. It could be nice fix this issue.

maintain3r commented 4 years ago

Any updates on this or the project is considered as dead (hope it's not)?