the-tcpdump-group / libpcap

the LIBpcap interface to various kernel packet capture mechanism
https://www.tcpdump.org/
Other
2.72k stars 854 forks source link

Should the stdio buffer size be increased for the dump code? #792

Open guyharris opened 5 years ago

guyharris commented 5 years ago

A blog post from Napatech says:

The most obvious bottleneck was in the pcap_dump() function within libpcap and actually the problem should be fixed within libpcap if you ask me, but I fixed it in DPDK temporally. Basically the pcap_dump() function does a fwrite(pcap header of 16Bytes) followed by a fwrite(packet payload) and these two fwrite(s) cause two kernel traps, which are expensive and should be minimized. The solution is to buffer the output to file and write seldom but more data at every write, so in essence do: if (buffer has room) do memcpy header and payload to buffer else fwrite buffer

That's not strictly true; what fwrite() does is, in essence:

if (buffer has room) do
  memcpy header and payload to buffer
else
  write buffer

so each call to fwrite() does not result in a single system call ("write buffer" is assumed to turn into a system call - write() on UN*X, _write() -> WriteFile() -> NtWriteFile() on modern Windows).

However, he may have used a larger buffer in the buffering wrapped around fwrite() than the standard I/O library was using. On Linux with glibc, and on BSD-flavored UN*Xes including macOS, the buffer size for a regular file will be the st_blksize value supplied by the file system for an fstat() (unless the file system fails to supply it and it's zero, in which case it's 8KiB with glibc and 1KiB on BSD-flavored UN*Xes). Other UN*Xes may also use st_blksize. On Windows, it's 4KiB.

st_blksize, in 4.2BSD, was the FFS block size of the file system, typically 8KiB. It appears to be 4MiB on APFS and 4KiB on HFS+ on High Sierra.

So what we might want to do is:

bufsize = something large (bigger than 8KiB, at least);
#ifdef WE_HAVE_ST_BLKSIZE_IN_STRUCT_STAT
if (fstat(fileno(the file), &statb) succeeds and statb.st_blksize > bufsize)
    bufsize = statb.st_blksize;
#endif
buffer = malloc(bufsize);
if (buffer == NULL)
    either fail completely or give up on increasing the buffer;
setvbuf(fileno, buffer, _IOFBF, bufsize);

The Single UNIX Specification doesn't guarantee that passing a non-zero buffer size and a null buffer pointer will cause setvbuf() to allocate a buffer of that size internally; the C99 spec says only that

If buf is not a null pointer, the array it points to may be used instead of a buffer allocated by the setvbuf function) and the argument size specifies the size of the array; otherwise, size may determine the size of a buffer allocated by the setvbuf function.

emphasis on "may". Therefore, we allocate it manually.

This is a bit tricky as it would require us to free the buffer on close; the C99 spec says of fclose() that

Whether or not the call succeeds, the stream is disassociated from the file and any buffer set by the setbuf or setvbuf function is disassociated from the stream (and deallocated if it was automatically allocated).

which seems to indicate that such a buffer won't be freed automatically on close as it was manually allocated outside the standard I/O library.

guyharris commented 5 years ago

He did use a larger buffer - 16KiB. That was probably on Linux; the default st_blksize on Linux appears to be the page size, so it was probably 4KiB. If a 16KiB fwrite() bypasses the buffer and just writes directly to the file, as a result of being a multiple of the buffer size, then that'd be a 4x reduction in the number of write() calls and thus a 4x reduction in the number of system calls, which might make a significant difference.

A bigger buffer would result in an even greater reduction, although I don't know whether a buffer that's close to or bigger than the L1 data cache (32K on most of Intel's x86-64 processors) would be a problem.

guyharris commented 5 years ago

As for an automatically-allocated buffer:

However:

The System V Release 2 setvbuf() documentation says "If buf is not the NULL pointer, the array it points to will be used for buffering, instead of an automatically allocated buffer. Size specifies the size of the buffer to be used.", so that appears to be where the second group of UN*Xes - all SV-based, with a ton of BSD and vendor stuff added on top (Solaris being based on SVR4 which was a huge merge of SunOS 4.x into the SV code base, giving not only BSD stuff but Sun stuff) - got their man page text, so I guess some tomatoes should be thrown at AT&T for not making it clearer.

So, just for the lulz, let's see if we can find the SV source.

Here on archive.org there's a page that claims to have SV source from several releases; the tarball for SVR2 for NS32xxx (SysVr2.0-32000) has a version of setvbuf() with what appears to be the AT&T copyright notice from my days at Sun, and it...

...will, if the mode argument isn't _IONBF:

So I rather suspect that the `setvbuf() in any UN*X with an SV-derived stdio library - which probably means "all the traditional commercial UN*Xes that still survive, plus Tru64 and even IRIX" - will, if called early enough (i.e., before you do anything that needs a buffer), and passed a null buffer pointer and a non-zero buffer size, allocate a buffer of that size.

Any UN*X with a 4.4 BSD/4.4-Lite BSD-derived stdio library - which means the *BSDs, macOS/iOS/tvOS/watchOS, and, apparently, Android - will do so as well.

Unless the Linux documentation is lying, GNU libc's stdio library will do that.

The Windows libc will do that as well.

And, from a look at the MUSL source, it ignores the size argument if the buf argument is null.

So my inclination is to do the setvbuf() call; if it doesn't increase the buffer size, you lose, I guess, and should go yell at the supplier of your libc to make it work. At least the dumping APIs will still work, with the default buffer size, even if that means more system calls than you'd like.

(Given that a pcap_dumper_t * is just a FILE *, meaning we don't have some data structure that we control, we have no mechanism for getting at the buffer at close time to free it, so we have to have the automatically-allocated buffer.)

AndersBroman commented 5 years ago

Something like this? --- a/sf-pcap.c +++ b/sf-pcap.c @@ -771,7 +771,11 @@ pcap_setup_dump(pcap_t p, int linktype, FILE f, const char *fname) else setvbuf(f, NULL, _IONBF, 0);

endif

Lekensteyn commented 5 years ago

With GNU libc (glibc), setvbuf(fp, NULL, _IOFBF, size) will ignore size if buf is NULL, see libio/iosetvbuf.c:


int _IO_setvbuf(FILE* fp, char* buf, int mode, size_t size) {
  int result;
  CHECK_FILE(fp, EOF);
  _IO_acquire_lock(fp);
  switch (mode) {
    case _IOFBF:
      fp->_flags &= ~(_IO_LINE_BUF | _IO_UNBUFFERED);
      if (buf == NULL) {
        if (fp->_IO_buf_base == NULL) {
          /* ... */
          if (_IO_DOALLOCATE(fp) < 0) {
            result = EOF;
            goto unlock_return;
          }
          fp->_flags &= ~_IO_LINE_BUF;
        }
        result = 0;
        goto unlock_return;
      }
      break;
      // ...
  }
  result = _IO_SETBUF(fp, buf, size) == NULL ? EOF : 0;

unlock_return:
  _IO_release_lock(fp);
  return result;
}

If buf is NULL, then a buffer is allocated via _IO_DOALLOCATE. That is a macro that will probably end up calling _IO_file_doallocate from libio/filedoalloc.c which uses BUFSIZE or (if positive and smaller), st.st_blksize. So this is probably not what you want, and an explicit buffer has to be given.

guyharris commented 5 years ago

Yeah, the Linux man page isn't as explicit as I'd like, but "If the argument *buf* is NULL, only the mode is affected" does suggest that the size isn't affected.

If an explicit buffer is given, we don't have any good way to free it in pcap_dump_close(), given that:

  1. a pcap_dumper_t * is just another name for a FILE *, so we don't have any place to save a pointer to the buffer;
  2. there's code out there that needs to, for example, do an fflush() on the file being written, and does so by doing the fflush() on the pcap_dumper_t * cast to a FILE * if pcap_dump_flush() isn't available, so it's risky to change what a pcap_dumper_t is;
  3. there's no generally-available API to get a pointer to the buffer.

So the choices we have are:

  1. don't do this on Linux;
  2. develop new APIs for dumping, with a handle type different from pcap_dumper_t * that points to an opaque structure that includes the FILE * and a void * pointing to the buffer.

I'm inclined to go for the second option.

mcr commented 5 years ago

Guy Harris notifications@github.com wrote:

1 a pcap_dumper_t is just another name for a FILE , so we don't have any place to save a pointer to the buffer; 2 there's code out there that needs to, for example, do an fflush() on the file being written, and does so by doing the fflush() on the pcap_dumper_t * cast to a FILE

I guess this was a bad thing. Such code would fail to recompile, but of course it's an ABI change.

> * if pcap_dump_flush() isn't available, so it's risky to change what a
> pcap_dumper_t is; 3 there's no generally-available API to get a pointer
> to the buffer.

:-(

> So the choices we have are:

> 1 don't do this on Linux;
> 2 develop new APIs for dumping, with a handle
> type different from pcap_dumper_t * that points to an opaque structure
> that includes the FILE * and a void * pointing to the buffer.

> I'm inclined to go for the second option.

Me too. We are going to need it for pcapng.

-- ] Never tell me the odds! | ipv6 mesh networks [ ] Michael Richardson, Sandelman Software Works | IoT architect [ ] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

guyharris commented 5 years ago

We are going to need it for pcapng.

Yes. I wouldn't want to require pcapng to get the bigger stdio buffer, and I'm not sure whether we should have a single new API for writing both pcap and pcapng files, but either way we should have support for writing pcap files, not just pcapng files, with a bigger stdio buffer.

Lekensteyn commented 5 years ago

pcap_dumper_t is already an opaque structure (already since its initial introduction in 1999), the fact that it is type-casted from FILE * is an internal implementation detail. Are there any issues with converting this to an actual structure?

guyharris commented 5 years ago

pcap_dumper_t is already an opaque structure (already since its initial introduction in 1999), the fact that it is type-casted from FILE * is an internal implementation detail. Are there any issues with converting this to an actual structure?

To quote an earlier comment:

there's code out there that needs to, for example, do an fflush() on the file being written, and does so by doing the fflush() on the pcap_dumper_t * cast to a FILE * if pcap_dump_flush() isn't available, so it's risky to change what a pcap_dumper_t is;

mcr commented 5 years ago

I'm unclear if there is still work to do here.

Oppen commented 5 years ago

I think not, and the lack of activity during 5 months backs me up. Can it be closed?

mcr commented 5 years ago

We are going to do the setvbuf() in case it helps. Ticket remains open for now.