schveiguy / iopipe

D language library for modular io
Boost Software License 1.0
77 stars 6 forks source link

Buffering output (possible enhancement) #17

Open jondegenhardt opened 6 years ago

jondegenhardt commented 6 years ago

I've discovered that output to stdout can be materially sped up by buffering and writing in blocks (not sure about files). This is even true when using a LockingTextWriter. I don't have a full analysis yet or a complete characterization of when this occurs. However, at this point technique has materially sped up the majority of my command line tools (tested on MacOS). Largest speedups occur writing short lines, but also when writing short blocks within a line. On the surface, it appears that buffering can be done in a way consistent with system buffering of stdout, so that interactivity is not unduly lost.

So, there might be an opportunity for IOPipe to support buffered output that results in a meaningful performance win compared to current Phobos. What I'm looking into for my tools is creating a buffered output stream/range, one that would hold an underlying output sink where buffered writes are written to at appropriate intervals. Some of the speeds gains are quite significant, up to 50% in some cases, though 10-30% is more common.

A bit more background - It's looking like std.algorithm.joiner is unnecessarily slow. What started me looking at this was noticing that std.array.join with write to stdout is about 4x faster than joiner, despite that join allocates GC memory. (e.g. writeln(data.[join|joiner](delimiter))). One reason is that join sends a single array to writeln. writeln in turn may have unnecessary overhead operating on ranges. But joiner is also looks like it is materially slower than a hand written loop anyway.

I'm managing to address both these issues with buffering, plus additional enhancements by multi-line buffering. This is why I don't have the details characterized. It'll take some work to separate out these interacting pieces.

schveiguy commented 6 years ago

I have to do some more focusing on how output works. I have been mainly focusing on input, as it's a more interesting, and difficult problem. Output should be very straightforward -- write to a buffer, when it gets full, flush it and reset it.

On top of that, std.format I think has some benefits to gain to having direct buffer access as well. So I may even need a clone of that that uses iopipes more efficiently :)

jondegenhardt commented 6 years ago

Ah, yes, makes sense that std.format might benefit as well. There are a couple in-progress PRs that look related: phobos 5991, phobos 6060.

I wrote a simple version of a buffered stream for my tools: BufferedOutputRange. From a performance perspective it's quite effective.

JackStouffer commented 6 years ago

@jondegenhardt Please let me know of any areas in Phobos where you think OutputRange support is lacking/could be improved

JackStouffer commented 6 years ago

Also, @jondegenhardt, I'd be curious to see your performance numbers for BufferedOutputRange. If it makes a significant difference, I'd want to consider getting it into Phobos.

jondegenhardt commented 6 years ago

@JackStouffer Sure, I've got a few thoughts on output ranges at this point. Nothing that will be surprising or new, but I can provide my thoughts. Is Slack a good venue? @schveiguy might not want the discussion here.

As to performance - I've got some numbers from my own tools. Not as methodical as what I normally publish, but good enough to be clear about value. For tsv-select (essentially unix cut with field reordering), on narrow files, run time is cut in half. I ran against a set of 10 google ngram files, 200M lines, about 10-15 bytes per line. tsv-select -f 2,4 goes from 52 seconds to 25 seconds. On the wider (longer line) files in my standard benchmark suite against it's more like 5-10% gain. Note though that this is not a measurement of output time by itself, but against the full app. So, on wider files what dominates is input, so output gains as a percentage are smaller.

Another example is csv2tsv, which has its run time cut by 1/3 on my standard benchmarks. I've tried buffering on a number of my tools at this point and am seeing similar patterns. The percentage gains depend on the nature of what's being output and what else the tool is doing, but these are not isolated cases.

Rather than use my tools, a standalone benchmark could be created, this would provide a better characterization. But, I've seen enough to conclude there is an opportunity, whether done via a buffered output range or making improvements inside the I/O layers themselves.

Update: Regarding the performance numbers: These were done on MacOS using ldc 1.5.0, -O -release -boundscheck=off, and using ThinLTO applied to druntime, phobos, and the app code, also using PGO instrumented code.

schveiguy commented 6 years ago

I would note that this is really a consequence of the fact Phobos' File type uses C FILE * underneath. If you look at LockingTextWriter.put, it basically calls fputc for each byte you give it. This is an opaque call, not inlineable, can cause cache problems, and causes numerous checks/etc on every byte. Whereas if you are just writing a byte to a buffer, the cost is vastly less. The flush is still doing an opaque call (ironically, just copying from your local buffer to the FILE * buffer), but for an entire buffer instead of a single byte.

I think iopipe can be faster than Phobos, simply because you aren't double-buffering. But as @jondegenhardt knows, I've been wrong in my predictions before :)

schveiguy commented 6 years ago

schveiguy might not want the discussion here.

Meh, I like reading these kinds of conversations :) I don't mind if you have it here, but certainly it may be more productive for you if you move it somewhere more live.

jondegenhardt commented 6 years ago

If you look at LockingTextWriter.put, it basically calls fputc for each byte you give it.

How about writeln and friends? Until a couple days ago, my tools were using a mix of write/f/ln/fln and LockingTextWriter.put. When I converted to BufferedOutputRange, I was converting some of each. I saw gains across the board, but didn't try to distinguish between the two.

schveiguy commented 6 years ago

writeln should be fast if you are writing a string. If you are writing a range other than a string (e.g. a joiner of strings for instance), then writeln I think decays to fputc (and BTW I'm pretty sure it auto-decodes).

It comes from the limitations of the FILE * interface. It only accepts either a char array or a char.

jondegenhardt commented 6 years ago

Okay, that's consistent with the behavior I'm seeing. It's also explains why forms like:

      writeln(inputRange.joiner(delim));

are slow. My preliminary numbers say joiner is slow compared to a hand written loop (perhaps 2x), but it get even worse (perhaps 4x) when combined with writeln. This is unfortunate because it should be the canonical way to code this form.

jondegenhardt commented 6 years ago

@JackStouffer General comments on OutputRanges - First, OutputRanges are great. I've been using them as described in a number of D articles on "component programming". The core data processing routines take an OutputRange rather than writing directly to stdout. The main program typically passes stdout.lockingTextWriter to these routines. Unit tests, however often pass an Appender, or sometimes a LockingTextWriter pointing to a temporary file.

The immediate benefit is that unit tests are far easier to write. The routines are more reusable as well. The difference is easy to see comparing my first D programs with later ones. However, there are a couple places that could be improved:

(A) put has limited functionality compared to the write / writeln / writef / writefln family. Code often ends up with a pair for put(line); put('\n'); makes a direct call to format, etc.

This is more verbose. The other part is that if you start out quickly prototyping a program using the write family, converting to put is a little annoying.

This is strictly an inconvenience, but my personal view is that it would increase the value of OutputRanges quite a bit to add some equivalent functionality. The equivalent of putf, putln, putfln, or something like that. It'd probably be necessary to create a new flavor of OutputRange, but there are already many flavors of ranges. Also, clearly here I'm talking about character oriented OutputRanges. char and ubyte are the only type of OutputRange I've used, but this could mark a meaningful differentiation between types of OutputRanges.

(B) If the OutputRange is actually pointing to a real external device, like stdout, it's sometimes necessary to run an operation against that device. In my tools, the one that comes up is being able to flush the output at certain intervals. As an example, a command line tool that is filtering input lines and writing to stdout, ala grep, might want to ensure matched lines are written when matched. Now, LockingTextWriter doesn't expose such functionality, and it's the only way I know to turn a File object into an OutputRange, so it's not just about the OutputRange API. But, it would be useful to be able flush an OutputRange if is actually attached to an external output device. There may be other useful opertions, but flush is the one that has surfaced for me.

I've had to design around this limitation in my tools. For example, write the core data processing algorithm at the level of a single line, then have the caller create and pass a stdout.lockingTextWriter for every line. This has the effect of flushing to stdout every line.

If you look at my BufferedOutputRange struct I've design some flushing behavior into it. By default, checking the buffer size at newline boundaries, and if the buffer has reached a certain size, flushing it. However, the caller can call flush directly. This is used for the filtering case, so the caller can ensure that flush to stdout occurs at specific intervals.

wilzbach commented 6 years ago

it would be useful to be able flush an OutputRange

FYI: something similar (clear) has been discussed before, but got stalled:

https://github.com/dlang/phobos/pull/3362

schveiguy commented 6 years ago

The equivalent of putf, putln, putfln, or something like that.

Admittedly a less verbose option could be good, but we do already have putf:

import std.format;
outrange.formattedWrite("line %s: %s\n", lineNum, line);

Note that write[f]ln is simply a wrapper for write[f](args, '\n'). You could whip up all those functions in a few lines.

schveiguy commented 6 years ago

I will note also, I can vouch for the idea of using a more generic interface for functions. It's awesome in iopipe that all my functionality can be tested by pushing strings as iopipes instead of having to generate files.

schveiguy commented 6 years ago

being able to flush the output at certain intervals

I added the flush function to iopipe output loops, but It's not defined as "official" whenever you have an output buffer to require this method (i.e. there's no isOutputIopipe trait). I probably should formalize it...

JackStouffer commented 6 years ago

The immediate benefit is that unit tests are far easier to write.

I never considered that. It's always been a pet peeve of mine how annoying it can be to restructure code in other languages to unit test console output.

put has limited functionality compared to the write / writeln / writef / writefln family. Code often ends up with a pair for put(line); put('\n'); makes a direct call to format, etc. The equivalent of putf, putln, putfln, or something like that. It'd probably be necessary to create a new flavor of OutputRange, but there are already many flavors of ranges.

Can't you use std.format.formattedWrite?

Been thinking about the per character put vs the string put in relation to lockingTextWriter and BufferedOutputRange. I think the right design for a possible inclusion into Phobos is to design a BufferedOutputRange such that it wraps any possible OutputRange. Its template parameters would look like BufferedOutputRange!(OR, Allocator, buffSize) such that an array is allocated of buffSize using Allocator which conforms to std.allocator, and it calls std.range.put(underlyingOutRange, buff) when buff is full. It should also use RAII to auto deallocate the buffer. Thoughts?

jondegenhardt commented 6 years ago

Admittedly a less verbose option could be good, but we do already have putf:

Oh. I had no idea. So perhaps part of this is a matter of documentation. And perhaps whipping up a few of the other combos. :)

jondegenhardt commented 6 years ago

Can't you use std.format.formattedWrite?

I think so. Just had never looked at it, didn't know it wrote to an output range.

jondegenhardt commented 6 years ago

I think the right design for a possible inclusion into Phobos is to design a BufferedOutputRange such that it wraps any possible OutputRange. Its template parameters would look like BufferedOutputRange!(OR, Allocator, buffSize) such that an array is allocated of buffSize using Allocator which conforms to std.allocator, and it calls std.range.put(underlyingOutRange, buff) when buff is full.

I had mine take a File as well, which needs to be open for write. If it's a file,write is used rather than put. I did this for two reasons. One is that LockingTextWriter holds a lock it's whole lifetime. I was concerned this could lead to unexpected complications. Directly calling write when flushing the internal buffer holds the lock only until the write completes. Second was something I'm unclear about, but write on a File seems to flush the output, whereas it's not clear that put does. I'm not 100% sure about that though, more my recollection of the behavior.

It should also use RAII to auto deallocate the buffer.

Also to flush any content remaining in the buffer.

jondegenhardt commented 6 years ago

A couple other thoughts:

jondegenhardt commented 6 years ago

@MartinNowak - This is the thread on output buffering I mentioned.

jondegenhardt commented 5 years ago

I ran a benchmark using a simple buffering scheme on top of io output to standard output. Quite an improvement over the unbuffered version. Don't know if there's an iopipe angle to this or if it applies strictly to @MartinNowak's io library, but the pair of iopipe with buffered io output is a winner, especially on skinny files.

Benchmark here: dcat-perf Issue 4