tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
693 stars 283 forks source link

Support for pcap offline network packet format for tensorflow-io #264

Open yongtang opened 5 years ago

yongtang commented 5 years ago

This issue is to track the effort to add offline network packet format (pcap) data for tensorflow-io.

Due to the lack of pcap format support, there are many cases where users are forced to converts pcap files to CSV, XML or JSON format back and force:

See https://github.com/tensorflow/io/issues/50#issuecomment-496729552 for additional details.

/cc @ivelin

ivelin commented 5 years ago

Thank you, @yongtang . I will keep an eye on this and try to help as I build my confidence with the code base.

H21lab commented 5 years ago

Hi @ivelin, @yongtang

I would suggest to clarify, if you would like:

only to decode pcap and to be able to get the raw bytes from the packets or the decoding of different protocol layers is requested here As seen in the first example, the wireshark/tshark is dissecting different protocol layers and only some decoded protocol fields are used to train neural network. This is useful to train the network with relevant information only, but then it requires protocol decoder (like wireshark/tshark).

So it is not only about pcap conversion into CSV, XML or JSON. But the pcap is "raw" format compared to XML, JSON which are already protocol decoded format.

Thanks Martin

yongtang commented 5 years ago

@H21lab I think the original issue raised by @ivelin is more about specific cases for UDP where the timestamp information (which could be important for ML models) can only be extracted from UDP packet itself. The initial goal of pcap dataset is to extract and provide timestamp information together with the content itself.

With respect to the issue you raised about Wireshark/protocol decoding, I think this is also an area tensorflow-io may help. Some simple method like tensorflow-io allows taking input from stdio/stdout then it is possible to pipe through Wireshark/tensorflow-io so that the processing works as a continusous pipeline.

ivelin commented 5 years ago

@H21lab good point. As @yongtang mentioned, the main goal here is to avoid conversion from a compact binary format to a 10x bigger text format before TF can ingest the data.

My suggestion is that with a library like dpkt that can peek into pcap files layer by layer on the fly people can pick and choose programmatically which attributes to feed into TF training and inference without intermediate text file conversion.

In real world scenarios its not uncommon for pcap files to grow into GB per minute and TB per day for just a handful of scanned network interfaces. Memory, disk and compute cost can become prohibitively expensive for TF processing.

Hope that makes sense.

H21lab commented 5 years ago

@ivelin @yongtang This makes sense. Maybe consider also other python libraries as scapy or pyshark (pyshark maybe with param use_json for better performance). The dpkt could be the fastest, however it would be not able to decode as much. But for this use case it can be a good choice.

ivelin commented 5 years ago

@H21lab thanks for the references. I'll take a look at these alternative python pcap libs.

ivelin commented 5 years ago

@H21lab Just fyi. Pull request #303 submitted. Feel free to take a look and chime in. @yongtang is helping me get it through the process.

yongtang commented 5 years ago

@H21lab With respect to your anomaly detection my understanding is that, you may only be interested in certain fields from the network capture (pcap).

The processing of pcap files might be ok. However, it looks like wireshark/tshark also involve heavily in the packet assembly and higher (application) level protocol decoding. For that part it is tied to GPL license so we could not easily link or include.

What we could do, though, is to create a tf.data.Dataset pipeline by reading the standard output (stdout) of tshark from standard input (stdin) pipe. The tf.data.Dataset pipeline could then directly be provided to tf.keras for training or inference.

This will allow streaming processing (no need to read and convert the whole pcap file, just streaming the data).

Created a PR #320 to add stdin support for tensorflow_io.text.TextDataset.

Here is an example (available in tests/test_text/stdin_test.py):

import tensorflow as tf
if not (hasattr(tf, "version") and tf.version.VERSION.startswith("2.")):
  tf.compat.v1.enable_eager_execution()
import tensorflow_io.text as text_io # pylint: disable=wrong-import-position

# Note: run the following:
#  tshark -T fields -e frame.number -e ip.dst -e ip.proto -r \
#         attack-trace.pcap | python stdin_test.py

# Note: decode_csv to obtain the fields. Further preprocessing
# is needed to extract feature (or feature + label).
def f(v):
  frame_number, ip_dst, ip_proto = tf.decode_csv(
      v, [[0], [''], [0]], field_delim='\t')
  return frame_number, ip_dst, ip_proto

text_dataset = text_io.TextDataset("file://-").map(f)

# The following is to integrate through the dataset,
# though dataset could be used in tf.keras with
# model.fit() or model.evaluate() directly as well.
# Preprocessing has to be done in f(v) function.
for (frame_number_value, ip_dst_value, ip_proto_value) in text_dataset:
  print(ip_dst_value.numpy())
H21lab commented 5 years ago

Hi @yongtang,

thank you for the PR. Yes I think this is the right approach:

Regards Martin

yongtang commented 5 years ago

Thanks @H21lab. It looks like tshark’s -T ek is really a new line separated json format.(ndjson). The format is also widely used in some other datasets such as quickdraw. The tensorflow-io package does not support this format yet. Though once JSON parser (PR #310) is added then this ndjson format could really be added with minimal updates easily.