kbandla / dpkt

fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
Other
1.09k stars 270 forks source link

Fix processing of PCAP files with trimmed packets #657

Open Phikimon opened 1 year ago

Phikimon commented 1 year ago

Network traffic datasets oftentimes omit the actual packet contents to reduce the dataset volume and probably for the sake of privacy. They do it with the use of tcpdump's --snapshot-length option or using scripts to only preserve headers up to transport layer. One example are MAWI lab datasets where packets are trimmed to lengths in the range 34-96 bytes depending on the packet type.

$ capinfos -s mawi.pcap | grep "size limit"
Packet size limit:   inferred: 34 bytes - 96 bytes (range)
$ editcap -F pcap -r mawi.pcap singlepacket.pcap 1
$ tcpdump -Atnnr ./singlepacket.pcap 2>/dev/null
IP 203.189.86.188.443 > 207.141.234.127.20627: Flags [.], seq 2465589976:2465591396, ack 326528497, win 31088, length 1420
E....X@.;..d..V.......P......vm.P.yp|...

We see that TCP reports contents of length 1420, but the actual contents printed in ASCII do not exceed 100 bytes. This indicates the packet is trimmed. By running a very simple dpkt script (see below) that would copy all packets from singlepacket.pcap to copied.pcap we get the following result with tcpdump repoting an error:

$ python3 dpktcopy.py singlepacket.pcap copied.pcap
$ tcpdump -Atnnr ./copied.pcap 2>/dev/null
IP truncated-ip - 1420 bytes missing! 203.189.86.188.443 > 207.141.234.127.20627: Flags [.], seq 2465589976:2465591396, ack 326528497, win 31088, length 1420
E....X@.;..d..V.......P......vm.P.yp|...

Wireshark would exhibit similar behavior such as failure to associate packets belonging to the same flow. This happens because each packet in pcap format has two fields in the header associated with it: 'len' and 'caplen', they give tcpdump a hint whether the packet was trimmed. Currently dpkt ignores 'len' field and only uses 'caplen'.

To fix this, I provide two commits - one for the Writer side and another for the Reader side. The former allows providing 'len' value to be written in the pcap packet header, and the latter exposes this value from the pcap file to the user.

The code changes to preserve 'len' field with proposed API would be minimal:

 import dpkt
 import sys

 in_file = open(sys.argv[1], 'rb')
 out_file = open(sys.argv[2], 'wb')

-pcap = dpkt.pcap.Reader(in_file)
+pcap = dpkt.pcap.PktlenReader(in_file)
 writer = dpkt.pcap.Writer(out_file)

-def callback(ts, buf):
+def callback(ts, pktlen, buf):
     eth = dpkt.ethernet.Ethernet(buf)
-    writer.writepkt(eth, ts)
+    writer.writepkt(eth, ts, pktlen)

 pcap.dispatch(0, callback)
$ python3 newdpktcopy.py singlepacket.pcap newcopied.pcap
$ tcpdump -Atnnr ./newcopied.pcap  2>/dev/null
IP 203.189.86.188.443 > 207.141.234.127.20627: Flags [.], seq 2465589976:2465591396, ack 326528497, win 31088, length 1420
E....X@.;..d..V.......P......vm.P.yp|...