Running dpkt in Spark - Githubissues

BrooksDonuts commented 3 years ago

We store our pcap data in s3. We have a python process that downloads the s3 data in a file and used dpkt to process those files, which works, but doesn't scale.

I'm writing a Spark version of this which will scale much better. However, dpkt seems to expect a file. The s3 pcap data in s3is just binary data, which can't be passed to dpkt. I've tried using a python temp file like this:

def parsePCAP(binaryPCAP):

tf = tempfile.TemporaryFile(mode="w+b")
tf.write(binaryPCAP)
tf.seek(0)
pcap = dpkt.pcap.Reader(tf)

What is being passed in (binaryPCAP) is the binary data of what is in the file.

But this is producing this error.

File "/usr/local/lib/python3.7/site-packages/dpkt/pcap.py", line 287, in init raise ValueError('invalid tcpdump header') ValueError: invalid tcpdump header

Does anyone have a solution to this, either a better way to do this, or fix for this one?

rliebscher commented 3 years ago

If you have your file data already as bytes object, it shouldn't be necessary to write a temporary file. You can probably wrap it into an io.BytesIO object (see also https://docs.python.org/3/library/io.html) and use this with pcap.Reader.

kbandla commented 3 years ago

like @rliebscher suggested - dpkt.pcap.Reader just needs a file-like object, and BytesIO is perfect for your case:

...
tf = BytesIO(binaryPCAP)
pcap = dpkt.pcap.Reader(tf)
for ts, data in pcap:
    print(ts, data)

kbandla / dpkt

Running dpkt in Spark #495