guyharris commented 5 years ago

The most obvious bottleneck was in the pcap_dump() function within libpcap and actually the problem should be fixed within libpcap if you ask me, but I fixed it in DPDK temporally. Basically the pcap_dump() function does a fwrite(pcap header of 16Bytes) followed by a fwrite(packet payload) and these two fwrite(s) cause two kernel traps, which are expensive and should be minimized. The solution is to buffer the output to file and write seldom but more data at every write, so in essence do: if (buffer has room) do memcpy header and payload to buffer else fwrite buffer

That's not strictly true; what fwrite() does is, in essence:

if (buffer has room) do
  memcpy header and payload to buffer
else
  write buffer

so each call to fwrite() does not result in a single system call ("write buffer" is assumed to turn into a system call - write() on UN*X, _write() -> WriteFile() -> NtWriteFile() on modern Windows).

However, he may have used a larger buffer in the buffering wrapped around fwrite() than the standard I/O library was using. On Linux with glibc, and on BSD-flavored UN*Xes including macOS, the buffer size for a regular file will be the st_blksize value supplied by the file system for an fstat() (unless the file system fails to supply it and it's zero, in which case it's 8KiB with glibc and 1KiB on BSD-flavored UN*Xes). Other UN*Xes may also use st_blksize. On Windows, it's 4KiB.

st_blksize, in 4.2BSD, was the FFS block size of the file system, typically 8KiB. It appears to be 4MiB on APFS and 4KiB on HFS+ on High Sierra.

So what we might want to do is:

bufsize = something large (bigger than 8KiB, at least);
#ifdef WE_HAVE_ST_BLKSIZE_IN_STRUCT_STAT
if (fstat(fileno(the file), &statb) succeeds and statb.st_blksize > bufsize)
    bufsize = statb.st_blksize;
#endif
buffer = malloc(bufsize);
if (buffer == NULL)
    either fail completely or give up on increasing the buffer;
setvbuf(fileno, buffer, _IOFBF, bufsize);

The Single UNIX Specification doesn't guarantee that passing a non-zero buffer size and a null buffer pointer will cause setvbuf() to allocate a buffer of that size internally; the C99 spec says only that

If buf is not a null pointer, the array it points to may be used instead of a buffer allocated by the setvbuf function) and the argument size specifies the size of the array; otherwise, size may determine the size of a buffer allocated by the setvbuf function.

emphasis on "may". Therefore, we allocate it manually.

This is a bit tricky as it would require us to free the buffer on close; the C99 spec says of fclose() that

Whether or not the call succeeds, the stream is disassociated from the file and any buffer set by the setbuf or setvbuf function is disassociated from the stream (and deallocated if it was automatically allocated).

which seems to indicate that such a buffer won't be freed automatically on close as it was manually allocated outside the standard I/O library.

guyharris commented 5 years ago

He did use a larger buffer - 16KiB. That was probably on Linux; the default st_blksize on Linux appears to be the page size, so it was probably 4KiB. If a 16KiB fwrite() bypasses the buffer and just writes directly to the file, as a result of being a multiple of the buffer size, then that'd be a 4x reduction in the number of write() calls and thus a 4x reduction in the number of system calls, which might make a significant difference.

A bigger buffer would result in an even greater reduction, although I don't know whether a buffer that's close to or bigger than the L1 data cache (32K on most of Intel's x86-64 processors) would be a problem.

guyharris commented 5 years ago

As for an automatically-allocated buffer:

The Linux setvbuf() documentation, which presumably is describing the glibc version, "If the argument buf is NULL, only the mode is affected; a new buffer will be allocated on the next read or write operation.". The GNU libc documentation for setvbuf(), however, just says "If you specify a null pointer as the buf argument, then setvbuf allocates a buffer itself using malloc. This buffer will be freed when you close the stream. ... Otherwise, buf should be a character array that can hold at least size characters.", which doesn't explicitly indicate that the buffer allocated by setvbuf() will have size size.
The MUSL libc documenation just says that the official documentation is the C standard and the POSIX standard and that "This portion of the manual is incomplete. Future editions will document musl's behavior where the standards specify that it is implementation-defined, non-standard extensions musl implements, and additional properties of musl's implementation.", so they don't say one way or the other whether setvbuf() has been extended to support setting the buffer size without explicitly setting the buffer.
The Bionic libc documentation says nothing, but the Bionic setvbuf() source indicates that it - and probably the rest of the standard I/O library - is BSD-derived, so it probably implements the same "NULL argument plus non-zero size plus not unbuffered means a buffer of the specified size will be automatically allocated" behavior, and that's what the code appears to do.
The FreeBSD 3.0 setvbuf() documentation says "If the size argument is not zero but buf is NULL, a buffer of the given size will be allocated immediately, and released on close.".
The NetBSD 2.0, OpenBSD 2.2, DragonFly BSD, and macOS setvbuf() documentation say the same thing (not surprising, given that they probably all started with the 4.4-Lite libc, and that behavior probably dates back a long time for all those OSes).
The Visual Studio 2015 setvbuf() documentation says "The array pointed to by buffer is used as the buffer, unless it is NULL, in which case setvbuf uses an automatically allocated buffer of length size/2 * 2 bytes.".

However:

The Solaris 11.4 setvbuf() documentation just says "If buf is not the null pointer, the array it points to will be used for buffering, instead of an automatically allocated buffer. The size argument specifies the size of the buffer to be used. If input/output is unbuffered, buf and size are ignored.", which could be read as suggesting that if buf is the null pointer, and the stream is buffered, it automatically allocates a buffer of size bytes, but they don't unequivocally state that.
The AIX 5.3 setvbuf() documentation says "If the Buffer parameter is not a null character pointer, the array it points to is used for buffering. The Size parameter specifies the size of the array to be used as a buffer, but all of the Size parameter’s bytes are not necessarily used for the buffer area.", which could again be read as suggesting that a null Buffer argument, non-zero Size argument, and Mode argument other than _IONBF means that a buffer of the specified size will automatically be allocated, but they don't say it explicitly;
The HP-UX 11i v3 setvbuf() documentation says "If buf is not the NULL pointer, the array it points to is used for buffering instead of an automatically allocated buffer (from malloc()). size specifies the size of the buffer to be used. size specifies the size of the buffer to be used.", which again suggests that you can set the size of an automatically-allocated buffer but doesn't explicitly state it;
The Tru64 UNIX 5.1 setvbuf() documentation says "If the buffer parameter is not a null pointer, the array that the parameter points to is used for buffering instead of a buffer that is automatically allocated. The size parameter specifies the size of the buffer to be used." - you know the drill;
The IRIX 6.5 setvbuf() documentation says "If buf is the NULL pointer, or size is less than eight, space will be allocated to accommodate a buffer. This buffer will be of length BUFSIZ.", which means that 1) the good news is that a NULL buf means it'll automatically allocate a buffer but 2) the bad news is that, no matter what the value of size is, the buffer will be BUFSIZ bytes.

The System V Release 2 setvbuf() documentation says "If buf is not the NULL pointer, the array it points to will be used for buffering, instead of an automatically allocated buffer. Size specifies the size of the buffer to be used.", so that appears to be where the second group of UN*Xes - all SV-based, with a ton of BSD and vendor stuff added on top (Solaris being based on SVR4 which was a huge merge of SunOS 4.x into the SV code base, giving not only BSD stuff but Sun stuff) - got their man page text, so I guess some tomatoes should be thrown at AT&T for not making it clearer.

So, just for the lulz, let's see if we can find the SV source.

Here on archive.org there's a page that claims to have SV source from several releases; the tarball for SVR2 for NS32xxx (SysVr2.0-32000) has a version of setvbuf() with what appears to be the AT&T copyright notice from my days at Sun, and it...

...will, if the mode argument isn't _IONBF:

set the size to the minimum of the size argument and BUFSIZ;
if that's bigger thanBUFSIZ, or a buffer hasn't already been allocated, allocate a buffer of the specified size plus 8 bytes.

So I rather suspect that the `setvbuf() in any UN*X with an SV-derived stdio library - which probably means "all the traditional commercial UN*Xes that still survive, plus Tru64 and even IRIX" - will, if called early enough (i.e., before you do anything that needs a buffer), and passed a null buffer pointer and a non-zero buffer size, allocate a buffer of that size.

Any UN*X with a 4.4 BSD/4.4-Lite BSD-derived stdio library - which means the *BSDs, macOS/iOS/tvOS/watchOS, and, apparently, Android - will do so as well.

Unless the Linux documentation is lying, GNU libc's stdio library will do that.

The Windows libc will do that as well.

And, from a look at the MUSL source, it ignores the size argument if the buf argument is null.

So my inclination is to do the setvbuf() call; if it doesn't increase the buffer size, you lose, I guess, and should go yell at the supplier of your libc to make it work. At least the dumping APIs will still work, with the default buffer size, even if that means more system calls than you'd like.

(Given that a pcap_dumper_t * is just a FILE *, meaning we don't have some data structure that we control, we have no mechanism for getting at the buffer at close time to free it, so we have to have the automatically-allocated buffer.)

AndersBroman commented 5 years ago

Something like this? --- a/sf-pcap.c +++ b/sf-pcap.c @@ -771,7 +771,11 @@ pcap_setup_dump(pcap_t p, int linktype, FILE f, const char *fname) else setvbuf(f, NULL, _IONBF, 0);

endif

if (sf_write_header(p, f, linktype, p->snapshot) == -1) { +#define IO_BUF_SIZE (16 * 1024)
if (f != stdout) {
setvbuf(f, NULL, _IOFBF, IO_BUF_SIZE);
}
if (sf_write_header(p, f, linktype, p->snapshot) == -1) { pcap_fmt_errmsg_for_errno(p->errbuf, PCAP_ERRBUF_SIZE, errno, "Can't write to %s", fname); if (f != stdout)

Lekensteyn commented 5 years ago

With GNU libc (glibc), setvbuf(fp, NULL, _IOFBF, size) will ignore size if buf is NULL, see libio/iosetvbuf.c:


int _IO_setvbuf(FILE* fp, char* buf, int mode, size_t size) {
  int result;
  CHECK_FILE(fp, EOF);
  _IO_acquire_lock(fp);
  switch (mode) {
    case _IOFBF:
      fp->_flags &= ~(_IO_LINE_BUF | _IO_UNBUFFERED);
      if (buf == NULL) {
        if (fp->_IO_buf_base == NULL) {
          /* ... */
          if (_IO_DOALLOCATE(fp) < 0) {
            result = EOF;
            goto unlock_return;
          }
          fp->_flags &= ~_IO_LINE_BUF;
        }
        result = 0;
        goto unlock_return;
      }
      break;
      // ...
  }
  result = _IO_SETBUF(fp, buf, size) == NULL ? EOF : 0;

unlock_return:
  _IO_release_lock(fp);
  return result;
}

If buf is NULL, then a buffer is allocated via _IO_DOALLOCATE. That is a macro that will probably end up calling _IO_file_doallocate from libio/filedoalloc.c which uses BUFSIZE or (if positive and smaller), st.st_blksize. So this is probably not what you want, and an explicit buffer has to be given.

guyharris commented 5 years ago

Yeah, the Linux man page isn't as explicit as I'd like, but "If the argument *buf* is NULL, only the mode is affected" does suggest that the size isn't affected.

If an explicit buffer is given, we don't have any good way to free it in pcap_dump_close(), given that:

a pcap_dumper_t * is just another name for a FILE *, so we don't have any place to save a pointer to the buffer;
there's code out there that needs to, for example, do an fflush() on the file being written, and does so by doing the fflush() on the pcap_dumper_t * cast to a FILE * if pcap_dump_flush() isn't available, so it's risky to change what a pcap_dumper_t is;
there's no generally-available API to get a pointer to the buffer.

So the choices we have are:

don't do this on Linux;
develop new APIs for dumping, with a handle type different from pcap_dumper_t * that points to an opaque structure that includes the FILE * and a void * pointing to the buffer.

I'm inclined to go for the second option.

mcr commented 5 years ago

Guy Harris notifications@github.com wrote:

1 a pcap_dumper_t is just another name for a FILE , so we don't have any place to save a pointer to the buffer; 2 there's code out there that needs to, for example, do an fflush() on the file being written, and does so by doing the fflush() on the pcap_dumper_t * cast to a FILE

I guess this was a bad thing. Such code would fail to recompile, but of course it's an ABI change.

> * if pcap_dump_flush() isn't available, so it's risky to change what a
> pcap_dumper_t is; 3 there's no generally-available API to get a pointer
> to the buffer.

:-(

> So the choices we have are:

> 1 don't do this on Linux;
> 2 develop new APIs for dumping, with a handle
> type different from pcap_dumper_t * that points to an opaque structure
> that includes the FILE * and a void * pointing to the buffer.

> I'm inclined to go for the second option.

Me too. We are going to need it for pcapng.

-- ] Never tell me the odds! | ipv6 mesh networks [ ] Michael Richardson, Sandelman Software Works | IoT architect [ ] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

guyharris commented 5 years ago

We are going to need it for pcapng.

Yes. I wouldn't want to require pcapng to get the bigger stdio buffer, and I'm not sure whether we should have a single new API for writing both pcap and pcapng files, but either way we should have support for writing pcap files, not just pcapng files, with a bigger stdio buffer.

Lekensteyn commented 5 years ago

pcap_dumper_t is already an opaque structure (already since its initial introduction in 1999), the fact that it is type-casted from FILE * is an internal implementation detail. Are there any issues with converting this to an actual structure?

guyharris commented 5 years ago

pcap_dumper_t is already an opaque structure (already since its initial introduction in 1999), the fact that it is type-casted from FILE * is an internal implementation detail. Are there any issues with converting this to an actual structure?

To quote an earlier comment:

there's code out there that needs to, for example, do an fflush() on the file being written, and does so by doing the fflush() on the pcap_dumper_t * cast to a FILE * if pcap_dump_flush() isn't available, so it's risky to change what a pcap_dumper_t is;

mcr commented 5 years ago

I'm unclear if there is still work to do here.

Oppen commented 5 years ago

I think not, and the lack of activity during 5 months backs me up. Can it be closed?

mcr commented 5 years ago

We are going to do the setvbuf() in case it helps. Ticket remains open for now.

the-tcpdump-group / libpcap

Should the stdio buffer size be increased for the dump code? #792

endif