sadikovi / spark-netflow

NetFlow data source for Spark SQL and DataFrames
Apache License 2.0
18 stars 11 forks source link

Consider adding aggregation similar to flow-tools #55

Open sadikovi opened 7 years ago

sadikovi commented 7 years ago

Aggregation should be flexible, e.g. specifying groupBy and aggregation on numeric columns. Also need to investigate why flow-tools drop records when doing report in some cases.

r4ravi2008 commented 6 years ago

Any update on this? Also are you considering to implement the ability to calculate flows directly from pcap files?

sadikovi commented 6 years ago

Hi,

I have not done much work for this. I normally use it with Spark, so aggregation could be done there (it is still slow, in my opinion, but I will address that later). I do not have pcap file sample to implement this functionality. So this issue is sort of stuck.

If you could help with pcap files, would be great.

P.S. Could you give me a link to pcap format and explain a little bit what it is structurally? I normally work with NetFlow files only, so do not have much experience with other formats.

r4ravi2008 commented 6 years ago

This should give you an idea about pcap file format : https://wiki.wireshark.org/Development/LibpcapFileFormat. Its pretty straightforward.

You will also find a lot of pcap samples on the same website.

I have a few questions so I can better understand if the feature I requested makes sense to implement in this library.

What software produces Netflow files for you. What is the main use case of this library and how is it supposed to be used?

Is there a gitter channel available so we can take this discussion further?

sadikovi commented 6 years ago

Thanks for the link.

pcap files look similar to netflow files, header is simpler though, which is a good thing. I can generate sample netflow files using flow-gen, which comes with flow-tools, I believe one could still install it using apt-get install flow-tools. You could also use nfdump to read those files.

Specification is here (streaming variant, I use files which are slightly different): http://netflow.caligare.com/netflow_v5.htm

Normally, we get files delivered in this format already (I assume collected and compressed by some cisco software and hardware), files can be somehow large (hundreds of megabytes compressed binaries).

This library is written mainly to use Apache Spark (http://spark.apache.org/) to read files and utilize cluster to do easy ETL, since library will convert netflow data into DataFrame, but can be used as Java code to read files. There is section in README how to do very simple test. Also some samples files are included in repository as test resources.

Do you use Spark to read pcap files?

Unfortunately there is no gitter channel.

r4ravi2008 commented 6 years ago

@sadikovi Thanks for the explanation. Is the process to dump netflow files automated out of the box - meaning is Cisco hardware is capable to doing that or is there some additional code that extracts the netflow files from the hardware and dump them in the place your spark job is looking for?

Yes we read pcap files using spark and technically speaking we should be able to calculate flows directly from pcap records. I guess I am stuck at researching that bit :)

sadikovi commented 6 years ago

@r4ravi2008 something like that, I am not exactly sure how collection happens - my main work is making sure that spark can read whatever files were delivered:)

I will have a look at pcap files this weekend to see how difficult it is to implement/use existing reader, will try to make it not to rely on any external commands.

How do you read pcap files? Do you use PipedRDDs and call shell command to read files?

sadikovi commented 6 years ago

I will be also, in addition to wiki, using this repo as reference (looks like it has quite a few examples: https://github.com/markofu/pcaps/tree/master/PracticalPacketAnalysis/ppa-capture-files).

r4ravi2008 commented 6 years ago

@sadikovi To read pcap files I used PortableDataStream and parse the binary data. You can do the similar thing with newHadoopApi if you want an RDD and or by specifying DefaultSource if you want DataFrame directly.

For parsing I kinda used references from from multiple sources: namely : this and this

If you are aiming for this library to be something like ntop/nprobe but with scalability, I think it makes sense to add the feature I mentioned. And I will be happy to help in that aspect :)

sadikovi commented 6 years ago

@r4ravi2008 would appreciate your help with this, thanks!