robcowart / elastiflow

Network flow analytics (Netflow, sFlow and IPFIX) with the Elastic Stack
Other
2.48k stars 592 forks source link

How to handle Traffic Graphs #250

Closed jcdaniel14 closed 5 years ago

jcdaniel14 commented 5 years ago

First of all, I really appreciate this tool has helped alot I work for a ISP and I'm receiving around 15k to 20k flows/sec, and currently I'm facing a problem with traffic graphs, so a traffic graph that we know should be around 80Gbps (verified on Cacti), shows ~14 Mbps on Kibana default graphs (no filters).

I know that netflow works by sending samples and I think it's configured to be around 1 out of 1000 samples right now, and I've seen the graphs are calculating traffic aggregating the bytes received on the packets. So I want to know how should I handle them, should I just put a multiplier to compensate or calculate moving average, I've tried both these but resulting graphs seem unreliable. Maybe there's something else I could apply?

As an example I'm showing last day traffic as presented via netflow image

vs. same traffic collected via SNMP (Cacti) image

robcowart commented 5 years ago

What hardware resources have you given to Logstash?

jcdaniel14 commented 5 years ago

The whole system (kibana-elastic-logstash) is within the same VM which has 24 GB of RAM 12vCPU cores (Xeon 2.7Ghz), I've assigned 6GB RAM to logstash JVM, and 10GB to elastic (since elastic was getting stalled when I set time range above 4 hours), I'm also considering disabling reverse dns resolution since we've had to wait around 15 minutes for flows to be correctly indexed and start appearing on visualizations.

As for storage its 1TB SSD

robcowart commented 5 years ago

You are undoubtedly dropping packets because Logstash can't keep up. There are a few things which you can do that will help.

  1. Run sudo sysctl -w net.core.rmem_max=33554432 to provide a bit more buffers for the incoming packets. To make this persistent across reboots, switch to the the ElastiFlow v3.x-dev branch and grab sysctl.d/87-elastiflow.conf and put it in /etc/sysctl.d

  2. Edit /etc/systemd/system/logstash.service, finding this line Nice=19 and changing it to Nice=0. By default Logstash is started as a low priority process. This KILLS throughput. A nice value of 0 will give it the same priority as normal processes.

  3. Increase the environment variable ELASTIFLOW_NETFLOW_UDP_WORKERS (or ELASTIFLOW_IPFIX_UDP_WORKERS if applicable) from the default of 4 to 6 or even 8.

This should greatly increase your throughput (disabling name lookups is also a good idea for now).

Having said all of this, I still think you will struggle to decode 20K flows/sec. https://github.com/robcowart/elastiflow/issues/244