robcowart / elastiflow

Network flow analytics (Netflow, sFlow and IPFIX) with the Elastic Stack
Other
2.48k stars 595 forks source link

Problem with bandwidth measurement #538

Closed MikeFil closed 3 years ago

MikeFil commented 4 years ago

Hello. Could you help with the problem - on some hosts the bandwidth is very different from the snmp data. Also, the more interfaces on such a host the netflow-v9 is monitored, the greater the deviation in the bandwidth readings. Example: 1 2

robcowart commented 4 years ago

You may be dropping UDP packets, and thus missing flow records. Please take a look at the tuning recommendations in this post...

https://github.com/robcowart/elastiflow/issues/131#issuecomment-408211563

MikeFil commented 4 years ago

These recommended settings are complete. UDP packets are not dropping.

robcowart commented 4 years ago

Is your device using sampling? If so you should confirm that it actually sends the sample rate in the flow records. If the sample rate is not in the flows, you will need to add the sampling interval in https://github.com/robcowart/elastiflow/blob/master/logstash/elastiflow/dictionaries/sampling_interval.yml.

MikeFil commented 4 years ago

Sampling is not used. Also there are deviations more snmp values. For example: snmp 80MBi/s - in elastiflow 150-160MBi/s

robcowart commented 4 years ago

Flow data is never as exact for a specific point in time as SNMP, especially if flow timeouts are long. When it differs significantly, it is usually one of the above reasons. However to determine the exact cause would require more information about the network, and a PCAP of traffic and flow records.

robcowart commented 4 years ago

What is suspicious to me is how flat the data is in the charts. This usually indicates dropped packets. Can you run netstat -su twice about 60s apart on the machine where Logstash is running and share the results?

MikeFil commented 4 years ago

logstash:~$ netstat -su IcmpMsg: InType0: 9 InType3: 42 InType8: 13281 OutType0: 13281 OutType3: 13692 OutType8: 24 Udp: 105820002 packets received 473464 packets to unknown port received 8043502 packet receive errors 5379 packets sent 8043502 receive buffer errors 0 send buffer errors IgnoredMulti: 6 UdpLite: IpExt: InMcastPkts: 4 InBcastPkts: 6 InOctets: 183173278156 OutOctets: 3051951872410 InMcastOctets: 144 InBcastOctets: 468 InNoECTPkts: 196505153

logstash:~$ netstat -su IcmpMsg: InType0: 9 InType3: 42 InType8: 13281 OutType0: 13281 OutType3: 13692 OutType8: 24 Udp: 105824291 packets received 473464 packets to unknown port received 8043502 packet receive errors 5379 packets sent 8043502 receive buffer errors 0 send buffer errors IgnoredMulti: 6 UdpLite: IpExt: InMcastPkts: 4 InBcastPkts: 6 InOctets: 183179751003 OutOctets: 3052089155976 InMcastOctets: 144 InBcastOctets: 468 InNoECTPkts: 196512135

robcowart commented 4 years ago

If you can provide a PCAP of your flow records I can take a look for anything unexpected, but if you have no logs or other indication of dropped/missed data the issue might be the device under reporting.

MikeFil commented 4 years ago

Unfortunately, the security service prohibits the sending of PCAP. Thank you for your help.

MikeFil commented 4 years ago

Hello. There is new information on this issue. Overstatement of bandwidth measurements in netflow are due to lack of flow deduplication. And incorrect display on examples occurs on cisco asr, at the moment, with the help of an engineer from cisco, the cause of this problem is sought.

maintain3r commented 4 years ago

Hello, Is there a progress? Graphs build upon data from my sFlow v5 agents (cisco nexus 3k) are not accurate. The accuracy is bad and there's no pattern/ratio when I compare it with my graphs gathered through snmp. sflow-sampling-interval is 4096 (this is the min, can't put it lower). Looks like logstash-sflow-codec doesn't take into account counter samples. Any thoughts how to improve that?

robcowart commented 3 years ago

@MikeFil as long as only one device/exporter is being viewed, flow deduplication shouldn't make a difference. Unfortunately I am unable to further troubleshoot this with a sample of the data.