robcowart / elastiflow

Network flow analytics (Netflow, sFlow and IPFIX) with the Elastic Stack
Other
2.48k stars 596 forks source link

Alternative to Logstash as netflow ingest processor #363

Closed sgreszcz closed 3 years ago

sgreszcz commented 5 years ago

Hi Rob,

I was wondering if you are considering a more performant netflow ingestion processor to replace Logstash.

We find that we need 4 CPU to consume 1200 netflow v9 packets per second using logstash, even without your Elastiflow enhancements.

Some of the things we are considering are vflow (Verizon) and goflow (Cloudflare) written in go, and even writing a Netflow/IPFIX input to the CNCF fluentbit written in C (as it has more efficient filtering than Logstash, and can do stream processing based on SQL queries - like Kafka).

I also saw that Elastic filebeat 7.2 supports netflow/IPFIX as well as java scripting.

I guess there is a lot of metadata work that needs to be done to enhance the netflow data to show up properly in your dashboarding...

Do you have anything that you are working on as an alternative to Logstash?

I saw that you closed these two issues, but curious if your position has changed. https://github.com/robcowart/elastiflow/issues/246 https://github.com/robcowart/elastiflow/issues/247

robcowart commented 5 years ago

Hmmm... I wrote a long detailed reply, which for some reason didn't post. I'll summarize...

I was actually up last night until about 1am working on a new collector. However, I don't want to just create yet another barely MVP quality collector, so it will take a while.

My issue with the other things available in the open source world remain the same. Most of them will get you to a decoded flow record, at least for the most common fields. However they all fall short when it comes to support for vendor-specific information elements and the transformation/enrichment of the data that is necessary for more advanced, yet user-friendly, analytics. Perhaps I am being too ambitious, but I want the new collector to be truly great.

robcowart commented 5 years ago

Are you running Logstash on the same box as Elasticsearch and Kibana? If so, you can get a boost by modifying the nice level at which Logstash is started. In the systemd service file for Logstash (should be something like /etc/systemd/system/logstash.service, there will be a line Nice=19. Change this to Nice=0 and restart Logstash with systemctl daemon-reload && systemctl restart logstash.

DanSheps commented 5 years ago

Honestly, I think it would be best off for now to keep this as a ELK stack system, IMO.

The collector would be good, as a side project, and once fully fleshed out you could include it, however I would still keep the logstash configs for people who want to use logstash for other collections in addition to netflow.

Also, 1200 flows per second is a decent amount, have you thought of using 2 logstash boxes and separating the flows from your devices? Also look at tweaking pipeline settings for elastiflow

sgreszcz commented 5 years ago

Honestly, I think it would be best off for now to keep this as a ELK stack system, IMO.

The collector would be good, as a side project, and once fully fleshed out you could include it, however I would still keep the logstash configs for people who want to use logstash for other collections in addition to netflow.

Also, 1200 flows per second is a decent amount, have you thought of using 2 logstash boxes and separating the flows from your devices? Also look at tweaking pipeline settings for elastiflow

I agree with what you say here, I'm currently running a separate server to scale/performance test logstash 7.2 (the new netflow "module") vs filebeat netflow.

If we ever do our own development of yet another netflow parser, we will align the naming with what is expected by elastic ECS/elastiflow.

sgreszcz commented 5 years ago

Are you running Logstash on the same box as Elasticsearch and Kibana? If so, you can get a boost by modifying the nice level at which Logstash is started. In the systemd service file for Logstash (should be something like /etc/systemd/system/logstash.service, there will be a line Nice=19. Change this to Nice=0 and restart Logstash with systemctl daemon-reload && systemctl restart logstash.

I've got a separate box now running logstash 7.2 with the new netflow module to do scale/performance testing. I'm not yet using any Elastic netflow dashboards nor Elastiflow, just indexing netflow v9. I'll let you know how it goes.

philippkahr commented 5 years ago

@sgreszcz if you are still in piloting, just upgrade to 7.3. directly. It has dropped on 31st July.

To be honest, I have NetFlow from cisco devices with the App Enhancement and IPFIX enabled on them, without sampling and my Logstash 7.3 can easily handle ~3.000 flows/s, with little to no indexing lag. Also doing GeoIP and reverse DNS.

4 CPU Cores each 2,5Ghz 8 Gb of RAM, Logstash configured to use 4GB

Also auditbeat, metricbeat, filebeat is running on the same host. Furthermore, it takes two additional syslog inputs and has to do some magic with them.

My load average sits somewhere between 2.7 and 3.

sgreszcz commented 5 years ago

I did some detailed performance comparison with filebeat 7.2 and logstash 7.2. Full disclaimer I work in Cisco IT, and we are looking at consuming all Cisco IT netflow, and filtering out collaboration flows based on nbar2 classification for the tool that we are building.

Anyway, we are consuming our current EMEA netflow which is coming in at about 13000 udp packets per second. It seems as though the UDP is being batched by the routers, so I'm not really sure how many flows per second I'm getting. Our UDP director (load balanced over 4 servers) that is feeding my netflow parser is sending around 7000 per second (according to its metrics), so I'm not sure why TCPdump and the Linux script is showing about 14k UDP packets, unless there is fragmentation?

For example wireshark shows this: No. Time Source Destination Protocol Length Info 19708 1.438789 10.115.8.70 173.38.202.202 CFLOW 1430 total: 22 (v9) records Obs-Domain-ID= 256 [Data:256] 19713 1.439103 10.115.8.70 173.38.202.202 CFLOW 1430 total: 22 (v9) records Obs-Domain-ID= 256 [Data:256] 19714 1.439113 10.115.8.70 173.38.202.202 CFLOW 1430 total: 22 (v9) records Obs-Domain-ID= 256 [Data:256] 19715 1.439116 10.115.8.70 173.38.202.202 CFLOW 130 total: 1 (v9) record Obs-Domain-ID= 256 [Data:256]

Here is the script showing inbound on my network interface (only netflow inbound). ~# ./pps.sh ens160 TX ens160: 1730 pkts/s RX ens160: 12928 pkts/s TX ens160: 1689 pkts/s RX ens160: 12906 pkts/s TX ens160: 1433 pkts/s RX ens160: 13080 pkts/s

We are running only logstash or filebeat in docker (one at a time) in a 16 CPU 16 GB Ubuntu 18.04 LTS server.

I did a 10s TCP dump capture when both logstash was running and when filebeat was running to look at how the parsing worked using wireshark.

For logstash it seems to flush to the network with an HTTP/POST containing 125 index/netflow JSON pairs. For filebeat, by default it flushes after around 50 netflows are received and then emits index/netflow JSON pairs.

For logstash it uses only 25% of the CPU (4 Cores about) but around 9 to 11 GB of RAM (Java heap). When I look at the server monitoring in Elastic I can see that as the netflow increases over the business day, there is no real increase in CPU/Memory for that server which likely means I still have some overhead to play with.

Surprisingly with filebeat, the CPU usage for the same flow input was 50% (about 8 cores) but only 300MB of RAM - significantly less than logstash. However, even with logstash doing GeoIP enrichment, etc. the netflow was indexed in ElasticSearch much nearer to realtime with logstash than with filebeat. Maybe there is some memory or parallelisation configuration I need to do with filebeat that I'm not doing now.

Here are the server longer-term stats when running logstash. I haven't done this with filebeat.

Screenshot 2019-08-12 at 22 40 59(1)

So the netflow I'm consuming is from one of our smaller global regions: I would still need to scale out to cover 5 more, larger geographical regions. I'm also going to performance test vflow and goflow for comparison (as well as maybe nfdump, but I think that has to write to disk and I'd prefer something that works all in memory).

I guess I still have a lot of questions about the new lo

sgreszcz commented 5 years ago

I guess I'm still very new to netflow parsing and understanding the functionality available with the new logstash netflow module. Does anyone know if you can still use a local MaxMind DB for private IP address range GeoIP lookups? Has anyone done some sort of netflow filtering - we are looking at exporting one of the nbar2 traffic type classifications, and then dropping everything that is not a voice/video (collaboration) type traffic. Out daily ElasticSearch index is about 145 GB with only the one region.

I still have to try enabling the elastiflow (and default ElasticSearch) dashboards for netflow. This would help us visualise flows, but we are really using the netflow to do path tracing of traffic flows.

philippkahr commented 5 years ago

In the kibana Health tab you can easily see how much lines are being added to an index per second.

Importing the dashboards is easy. Just go to the kibana settings and hit on ”saved objects”. In the installation.MD there is a section regarding this.

Logstash vs filebeat.

Ok, you mentioned a few things. I myself am not quite sure yet, so this is more a guess in the wild. Logstash handles all the enrichment locally and filebeat sets up an ingest pipeline processor. Mainly having elastic search node doing all the work.

Regarding Maxine GeoIP as far as I recall, elastiflow is using those already. So you might only have to adopt the database files and drop them into the right place on the filesystem and of course perform a logstash restart. I have never dealt with maxminddbs modifying, so I cannot help you there.

If you want to do an additional filter e.g. flow.src_ip IS NOT 192.168.0.0 you will have to modify files lying around in conf.d or better, create a new filter after some of the first steps.

Regarding nbar. There is already a dictionary available, that maps those nbar numbers to applications. So the principle would be the same as above. Create a filter after everything is applied and then if nbar.type EQUALS voice and do your magic.

dj-damix commented 5 years ago

Hi guys, how do you do load balancing in netflow/elastiflow context? I'm thinking to use nginx-udp load balacing, sending packets from my exporter to nginx, and nginx will in turn send the packets to multiple logstash instances processing the netflow. My only hesitation and question is this: does elastiflow need context, or the adiacent flows, to be effective, or as long as data reaches in the same elasticsearch cluster it doesn't matter that logstash processor 1 did a flow and logstash processor 2 did another flow?

alfredosola commented 5 years ago

Today, Elastic released Logstash 7.4.0 and officially deprecated the netflow module. They suggest Filebeat NetFlow module as a replacement. @robcowart is there any way to replace Logstash with Filebeat in the current version of Elastiflow?

robcowart commented 5 years ago

To be clear... they have deprecated the Logstash Module, not the Logstash Netflow codec.

The Logstash codec works within a Logstash UDP (or TCP) input to decode the raw Netflow and IPFIX payload. The Logstash Module was actually based on ElastiFlow 1.0.0 (but never really further maintained by Elastic), and like ElastiFlow it leverages the codec within its input logic.

Regarding the use of Filebeat, there are unfortunately a few issues.

  1. The biggest issue is that the Netflow input for Filebeat is part of X-Pack basic and is covered under the Elastic License, not Apache 2.0. An example of why this is a problem is containers. While I can provide the configuration required for Filebeat, and even some tooling to allow you to build your own pre-configured container... I cannot pre-build the container and make it available to users on Docker Hub, as I could be in violation of the restrictions on distribution of the Elastic Licensed code.
  2. It doesn't support decoding of all of the fields/IEs that can be decoded with the custom definitions included in ElastiFlow.
  3. It does very little enrichment of the raw data. Logstash would still be required to do much of the "post-processing" of the raw data.

To be fair, Filebeat can decode the raw flows with much better performance than Logstash. So using it has some merit.

Instead of using Filebeat, my plan is to transition to a collectors I am writing myself. I actually have a ton of work done already. However, since I still have to pay bills, my paying work has had the priority the last few months.

I definitely want to move away from Logstash (and not just for Flows... for Logs, SNMP polling, SNMP traps, telemetry, and more). It has too many limitations for larger, heterogeneous network environments. It is just about finding/making the time to do so.

sgreszcz commented 5 years ago

Today, Elastic released Logstash 7.4.0 and officially deprecated the netflow module. They suggest Filebeat NetFlow module as a replacement. @robcowart is there any way to replace Logstash with Filebeat in the current version of Elastiflow?

We tried filebeat (see above) and besides a completely different resource pattern (more CPU, less memory than logstash) we found that we were missing netflow packets getting to ES. I didn't see any event drops, etc in the logs, but didn't spend a lot of time and since switched back to Logstash.

Also, everything that Rob said above are challenges with filebeat.

We have looked at goflow/vflow but they have a dependency on kafka which we are trying to avoid for simplicity sake. We have also considered writing a netflow input to fluentbit (in C) or vector data router (in Rust) but are trying to avoid that too if possible.

One of our requirements would to be to filter and index only certain traffic types based on Nbar and Geo-enrich based on a private IP database on ingest/parsing at the edge.

Maybe I need to do more tuning on the filebeat parameters and try again.

robcowart commented 5 years ago

@sgreszcz your results with Filebeat are curious indeed.

BTW, I would encourage you to reconsider avoiding Kafka. While I understand the complexity argument, it does bring a lot of benefits as well. I consider my Kafka cluster to be even more important than my Elastic cluster. It is the ordered, persistent, record of truth.

Regardless, this thread got me motivated. It was a holiday in Germany today, so I spent a good part of the day working on the new collector.

sgreszcz commented 5 years ago

@sgreszcz your results with Filebeat are curious indeed.

BTW, I would encourage you to reconsider avoiding Kafka. While I understand the complexity argument, it does bring a lot of benefits as well. I consider my Kafka cluster to be even more important than my Elastic cluster. It is the ordered, persistent, record of truth.

Regardless, this thread got me motivated. It was a holiday in Germany today, so I spent a good part of the day working on the new collector.

Rob, we are planning on running kafka in our IT production as it is a great way to process/filter streams, etc. I'm actually quite keen on the technology and have studied it quite a bit.

The issue that we have is we are also using raw netflow for some product work we are doing and kakfa is a lot trickier to package up into a docker deployment than filebeat/logstash/ES due to it's security model, dependency on zookeeper and distributed nature.

Maybe in this case it would be easier to write an output to ES from an existing netflow parser that currently only sends to kafka, than writing a netflow input to a new data collector.

robcowart commented 3 years ago

With the availability of the beta of the new ElastiFlow Unified Flow Collector, Logstash will be deprecated as a collector. The new collector brings a lot of new features, and fixes a lot of issues with Logstash (most importantly). The performance is also more than 10x Logstash and 3x Filebeat. To get more information about the beta as well as find a link to the ElastiFlow Community Slack, go... HERE