hgn / captcp

A open source program for TCP analysis of PCAP files
http://research.protocollabs.com/captcp/
GNU General Public License v3.0
113 stars 40 forks source link

Connection hash collisions #32

Closed kristahi closed 8 years ago

kristahi commented 8 years ago

Hi,

I discovered that the way captcp hashes/identifies flows in order to associate them with connections is flawed.

Connections and subflows are identified by a kind of unique identifier/hash, uid computed in TcpConn.__init__ beginning at line 2273:

l = [ord(a) ^ ord(b) for a,b in zip(self.sipnum, self.dipnum)]

self.uid = "%s:%s:%s" % (
               str(self.ipversion),
               str(l),
               str(long(self.sport) + long(self.dport)))

This identifier is used in various places as an index/key into hashtables (well, dicts :) ) containing connections, and is notably used for associating subflows (uni-directional) with the connection (bi-directional) they are part of. For this purpose, the hash needs to be the same for any given port-pair between the same two addresses, regardless of source/destination ordering. I suppose this is why they are added together, since this is a commutative operation.

The problem is that uid is used as a direct key into tables, without checking for hash collisions. The above function is obviously not a perfect hash function, and will produce collisions if you are unlucky enough to have two connections with port pairs that add up to the same thing, which corrupts the data for those connections.

I am proposing a patch to rectify this by concatenating the (hex string representations of the) ordered port pair together instead of summation. This preserves the commutativity, but ensures we will not get collisions for the port pair part of the identifier. Note that the hash function overall still isn't perfect, you could get collisions between the IP address pairs, but that is perhaps slightly less likely.

Best wishes, Kristian

PS. If you wonder how I came across this, I managed to trigger this bug by starting a bunch of connections to sequentially increasing port numbers almost simultaneously under FreeBSD. When FreeBSD has to generate a lot of port numbers in a very short time, it temporarily switches ephemeral port selection strategy to simple sequential incrementation, see: https://www.cymru.com/jtk/misc/ephemeralports.html

Sometimes a pair of connect() calls would get inversed and I ended up with port pairs such as