SIGSEGV inside processMmapPackets

milan-zededa commented 1 year ago

In the EVE project, we recently started observing segmentation violations triggered from inside of processMmapPackets. go-pcap is used by EVE in multiple places and it is not clear which usage leads to SIGSEGV. However, we suspect that the call to OpenLive made from here eventually leads to panic. Version of go-pcap that we are using is that of the commit 2b2e9401028218c9c7381fbf76783c16ab0b725a

I'm attaching 3 instances of this crash. For each there is a text file with go-pcap logs and the stack trace at the end. Note that we may run multiple instances of OpenLive, hence multiple packet captures running in parallel could be interleaving in logs. It seems that the panic can manifest at least on two different code lines within processMmapPackets.

gopcap-crash1.txt gopcap-crash2.txt gopcap-crash3.txt

Right now I'm not able to find a simple sequence of steps to reproduce this on demand.

deitch commented 1 year ago

There are two distinct cases here, although both are similar outcomes, caused during a copy(dst, src []T) call.

The first 2 happen here. If you dig deeply enough, you get down to where it tries to read from the buffer and copy bytes.

The previous line gives the following log:

zedbox:go-pcap.(*Handle).processMmapPackets        binary parsing packet header of size 48

Which is from buf.Len(), so the buffer size is 48, and you can see in the next line that it is copying into a &hdr of type syscall.Tpacket3Hdr, and that the size of that is alignedTpacketHdrSize, which is set here to tpacketAlign(syscall.SizeofTpacket3Hdr), which is defined here as 0x30, which is 48.

If it had any issue, it already should have SIGSEGV a few lines earlier, when we try to read precisely that from the byte slice to make the buffer here:

buf := bytes.NewBuffer(b[:alignedTpacketHdrSize])

So this simply shouldn't be happening. Yet it is.

The only things I can think of offhand are:

maybe when binary.Read() calculates the size of the &hdr it gets confused and has a larger size. This is extremely unlikely.
maybe because b points to an mmap-ed data set from the kernel, it might have cleared it before we got a chance to read it. This could be.

If you manage to reproduce, we can add some logging and even better step through it.

The third dump you posted comes a little bit later, here. It successfully passed the previous line's read, referenced the slice (actually just after the regular data), and then called a copy(). Once again, it checks the size before doing anything here, that it is at least packetRALLSize, which is defined here, and is much bigger than 19!

It is possible that this is related to the ring buffer as well, but it seems strange that it would SIGSEGV like that.

With both of these, if you can replicate, we can start to walk through them.

milan-zededa commented 1 year ago

I was able to reproduce the issue and find the race between Close and processMmapPackets using this program: https://gist.github.com/milan-zededa/de150fea33a82ea2c438a10ff26a85f1 Run the program with a network interface name as the only argument. It repeatedly re-creates pcap handle with a high frequency. It is important to run the program against interface with a continuous stream of traffic. Eventually, Close interleaves with processMmapPackets in a way that ring memory is unmapped while processMmapPackets still continues processing remaining packets in the (now invalid) buffer. It helps to have logging enabled because processMmapPackets is slowed considerably and is therefore more likely to have unfinished work after the handle is closed. I have prepared a patch: https://github.com/packetcap/go-pcap/pull/42 But maybe there is even simpler solution..

deitch commented 1 year ago

Ooohhh. Solid debugging. I will comment there.

milan-zededa commented 1 year ago

Closing, fixed by https://github.com/packetcap/go-pcap/pull/42

packetcap / go-pcap

SIGSEGV inside processMmapPackets #41