Avoiding MmProbeAndLockPages overhead in ReadFile

Ext3h commented 1 year ago

Describe the bug Technically a performance bug, in the combination of npcap with libpcap, when trying to tune npcap for 10GBit+ operation.

This assumes 8MB userspace and kernel side buffers for npcap, in order to get the rate of syscalls to a manageable level in the first place.

For context have a look at the opposing side: https://github.com/the-tcpdump-group/libpcap/blob/fbcc461fbc2bd3b98de401cc04e6a4a10614e99f/pcap-npf.c#L542 https://github.com/the-tcpdump-group/libpcap/blob/fbcc461fbc2bd3b98de401cc04e6a4a10614e99f/pcap-npf.c#L432

When using PacketReceivePacket with above buffer sizes, a significant share of time (70%+) is spent in MmProbeAndLockPages on the level of NtReadFile prior to NPF_read respectively MmUnlockPages at IoCompleteRequest.

As a user of libpcap, the memory for the user space buffer is allocated privately within libpcap, and is never exposed in raw form to the user of the API. Only PacketInitPacket or PacketReceivePacket on the npcap end has the guarantee to see the raw buffer.

Given the design of these 2 APIs, as a user I can't do anything to speed up MmProbeAndLockPages from the outside. I can't choose large pages, I can't VirtualLock.

What's worse, is that the overhead scales linearly with the size of the user space buffer - not with the actual amount of data transferred.

Expected behavior

Specifying a large user space buffer on libpcap side doesn't result in excessive overhead on npcap side.

Either npcap optimizes the buffer for repeated use (e.g. by explicitly applying VirtualLock in PacketInitPacket), or libpcap turns smarter when allocating the memory.

Ext3h commented 1 year ago

Issue appears to have been exacerbated by not activating the blocking mode - and thereby calling PacketReceivePacket with a large buffer excessively often. The overhead is significantly lower in blocking mode when the majority of calls does actual work.

dmiller-nmap commented 1 year ago

This is an interesting idea, and I'm reopening it for further investigation. I would be interested in what your bottlenecks turn out to be once you've optimized transfers using the existing tools: pcap_setmintocopy() and pcap_getevent() or pcap_set_timeout(), as well as managing buffer sizes with pcap_setuserbuffer() and pcap_set_buffer_size().

I do think we could avoid some of this overhead, but VirtualLock may not be the right tool. Here are some links I found so far on the topic of locking a buffer to be used for DMA between user-mode and kernel-mode:

nmap / npcap

Avoiding MmProbeAndLockPages overhead in ReadFile #622