Open jondegenhardt opened 6 years ago
I have to do some more focusing on how output works. I have been mainly focusing on input, as it's a more interesting, and difficult problem. Output should be very straightforward -- write to a buffer, when it gets full, flush it and reset it.
On top of that, std.format I think has some benefits to gain to having direct buffer access as well. So I may even need a clone of that that uses iopipes more efficiently :)
Ah, yes, makes sense that std.format
might benefit as well. There are a couple in-progress PRs that look related: phobos 5991, phobos 6060.
I wrote a simple version of a buffered stream for my tools: BufferedOutputRange. From a performance perspective it's quite effective.
@jondegenhardt Please let me know of any areas in Phobos where you think OutputRange
support is lacking/could be improved
Also, @jondegenhardt, I'd be curious to see your performance numbers for BufferedOutputRange
. If it makes a significant difference, I'd want to consider getting it into Phobos.
@JackStouffer Sure, I've got a few thoughts on output ranges at this point. Nothing that will be surprising or new, but I can provide my thoughts. Is Slack a good venue? @schveiguy might not want the discussion here.
As to performance - I've got some numbers from my own tools. Not as methodical as what I normally publish, but good enough to be clear about value. For tsv-select
(essentially unix cut
with field reordering), on narrow files, run time is cut in half. I ran against a set of 10 google ngram files, 200M lines, about 10-15 bytes per line. tsv-select -f 2,4
goes from 52 seconds to 25 seconds. On the wider (longer line) files in my standard benchmark suite against it's more like 5-10% gain. Note though that this is not a measurement of output time by itself, but against the full app. So, on wider files what dominates is input, so output gains as a percentage are smaller.
Another example is csv2tsv
, which has its run time cut by 1/3 on my standard benchmarks. I've tried buffering on a number of my tools at this point and am seeing similar patterns. The percentage gains depend on the nature of what's being output and what else the tool is doing, but these are not isolated cases.
Rather than use my tools, a standalone benchmark could be created, this would provide a better characterization. But, I've seen enough to conclude there is an opportunity, whether done via a buffered output range or making improvements inside the I/O layers themselves.
Update: Regarding the performance numbers: These were done on MacOS using ldc 1.5.0, -O -release -boundscheck=off
, and using ThinLTO applied to druntime, phobos, and the app code, also using PGO instrumented code.
I would note that this is really a consequence of the fact Phobos' File
type uses C FILE *
underneath. If you look at LockingTextWriter.put
, it basically calls fputc
for each byte you give it. This is an opaque call, not inlineable, can cause cache problems, and causes numerous checks/etc on every byte. Whereas if you are just writing a byte to a buffer, the cost is vastly less. The flush
is still doing an opaque call (ironically, just copying from your local buffer to the FILE * buffer), but for an entire buffer instead of a single byte.
I think iopipe can be faster than Phobos, simply because you aren't double-buffering. But as @jondegenhardt knows, I've been wrong in my predictions before :)
schveiguy might not want the discussion here.
Meh, I like reading these kinds of conversations :) I don't mind if you have it here, but certainly it may be more productive for you if you move it somewhere more live.
If you look at LockingTextWriter.put, it basically calls fputc for each byte you give it.
How about writeln
and friends? Until a couple days ago, my tools were using a mix of write/f/ln/fln
and LockingTextWriter.put
. When I converted to BufferedOutputRange
, I was converting some of each. I saw gains across the board, but didn't try to distinguish between the two.
writeln
should be fast if you are writing a string. If you are writing a range other than a string (e.g. a joiner
of strings for instance), then writeln
I think decays to fputc
(and BTW I'm pretty sure it auto-decodes).
It comes from the limitations of the FILE *
interface. It only accepts either a char array or a char.
Okay, that's consistent with the behavior I'm seeing. It's also explains why forms like:
writeln(inputRange.joiner(delim));
are slow. My preliminary numbers say joiner
is slow compared to a hand written loop (perhaps 2x), but it get even worse (perhaps 4x) when combined with writeln
. This is unfortunate because it should be the canonical way to code this form.
@JackStouffer General comments on OutputRanges
- First, OutputRanges
are great. I've been using them as described in a number of D articles on "component programming". The core data processing routines take an OutputRange
rather than writing directly to stdout
. The main
program typically passes stdout.lockingTextWriter
to these routines. Unit tests, however often pass an Appender
, or sometimes a LockingTextWriter
pointing to a temporary file.
The immediate benefit is that unit tests are far easier to write. The routines are more reusable as well. The difference is easy to see comparing my first D programs with later ones. However, there are a couple places that could be improved:
(A) put
has limited functionality compared to the write
/ writeln
/ writef
/ writefln
family. Code often ends up with a pair for put(line); put('\n')
; makes a direct call to format
, etc.
This is more verbose. The other part is that if you start out quickly prototyping a program using the write
family, converting to put
is a little annoying.
This is strictly an inconvenience, but my personal view is that it would increase the value of OutputRange
s quite a bit to add some equivalent functionality. The equivalent of putf
, putln
, putfln
, or something like that. It'd probably be necessary to create a new flavor of OutputRange, but there are already many flavors of ranges. Also, clearly here I'm talking about character oriented OutputRanges
. char
and ubyte
are the only type of OutputRange
I've used, but this could mark a meaningful differentiation between types of OutputRanges
.
(B) If the OutputRange
is actually pointing to a real external device, like stdout
, it's sometimes necessary to run an operation against that device. In my tools, the one that comes up is being able to flush the output at certain intervals. As an example, a command line tool that is filtering input lines and writing to stdout
, ala grep
, might want to ensure matched lines are written when matched. Now, LockingTextWriter
doesn't expose such functionality, and it's the only way I know to turn a File
object into an OutputRange
, so it's not just about the OutputRange
API. But, it would be useful to be able flush an OutputRange
if is actually attached to an external output device. There may be other useful opertions, but flush is the one that has surfaced for me.
I've had to design around this limitation in my tools. For example, write the core data processing algorithm at the level of a single line, then have the caller create and pass a stdout.lockingTextWriter
for every line. This has the effect of flushing to stdout
every line.
If you look at my BufferedOutputRange
struct I've design some flushing behavior into it. By default, checking the buffer size at newline boundaries, and if the buffer has reached a certain size, flushing it. However, the caller can call flush
directly. This is used for the filtering case, so the caller can ensure that flush to stdout
occurs at specific intervals.
it would be useful to be able flush an OutputRange
FYI: something similar (clear
) has been discussed before, but got stalled:
The equivalent of putf, putln, putfln, or something like that.
Admittedly a less verbose option could be good, but we do already have putf
:
import std.format;
outrange.formattedWrite("line %s: %s\n", lineNum, line);
Note that write[f]ln
is simply a wrapper for write[f](args, '\n')
. You could whip up all those functions in a few lines.
I will note also, I can vouch for the idea of using a more generic interface for functions. It's awesome in iopipe that all my functionality can be tested by pushing strings as iopipes instead of having to generate files.
being able to flush the output at certain intervals
I added the flush
function to iopipe output loops, but It's not defined as "official" whenever you have an output buffer to require this method (i.e. there's no isOutputIopipe
trait). I probably should formalize it...
The immediate benefit is that unit tests are far easier to write.
I never considered that. It's always been a pet peeve of mine how annoying it can be to restructure code in other languages to unit test console output.
put has limited functionality compared to the write / writeln / writef / writefln family. Code often ends up with a pair for put(line); put('\n'); makes a direct call to format, etc. The equivalent of putf, putln, putfln, or something like that. It'd probably be necessary to create a new flavor of OutputRange, but there are already many flavors of ranges.
Can't you use std.format.formattedWrite
?
Been thinking about the per character put
vs the string
put in relation to lockingTextWriter
and BufferedOutputRange
. I think the right design for a possible inclusion into Phobos is to design a BufferedOutputRange
such that it wraps any possible OutputRange
. Its template parameters would look like BufferedOutputRange!(OR, Allocator, buffSize)
such that an array is allocated of buffSize
using Allocator
which conforms to std.allocator
, and it calls std.range.put(underlyingOutRange, buff)
when buff
is full. It should also use RAII to auto deallocate the buffer. Thoughts?
Admittedly a less verbose option could be good, but we do already have putf:
Oh. I had no idea. So perhaps part of this is a matter of documentation. And perhaps whipping up a few of the other combos. :)
Can't you use std.format.formattedWrite?
I think so. Just had never looked at it, didn't know it wrote to an output range.
I think the right design for a possible inclusion into Phobos is to design a BufferedOutputRange such that it wraps any possible OutputRange. Its template parameters would look like BufferedOutputRange!(OR, Allocator, buffSize) such that an array is allocated of buffSize using Allocator which conforms to std.allocator, and it calls std.range.put(underlyingOutRange, buff) when buff is full.
I had mine take a File
as well, which needs to be open for write. If it's a file,write
is used rather than put
. I did this for two reasons. One is that LockingTextWriter
holds a lock it's whole lifetime. I was concerned this could lead to unexpected complications. Directly calling write
when flushing the internal buffer holds the lock only until the write completes. Second was something I'm unclear about, but write
on a File
seems to flush the output, whereas it's not clear that put
does. I'm not 100% sure about that though, more my recollection of the behavior.
It should also use RAII to auto deallocate the buffer.
Also to flush any content remaining in the buffer.
A couple other thoughts:
@MartinNowak - This is the thread on output buffering I mentioned.
I ran a benchmark using a simple buffering scheme on top of io
output to standard output. Quite an improvement over the unbuffered version. Don't know if there's an iopipe
angle to this or if it applies strictly to @MartinNowak's io
library, but the pair of iopipe
with buffered io
output is a winner, especially on skinny files.
Benchmark here: dcat-perf Issue 4
I've discovered that output to
stdout
can be materially sped up by buffering and writing in blocks (not sure about files). This is even true when using aLockingTextWriter
. I don't have a full analysis yet or a complete characterization of when this occurs. However, at this point technique has materially sped up the majority of my command line tools (tested on MacOS). Largest speedups occur writing short lines, but also when writing short blocks within a line. On the surface, it appears that buffering can be done in a way consistent with system buffering of stdout, so that interactivity is not unduly lost.So, there might be an opportunity for IOPipe to support buffered output that results in a meaningful performance win compared to current Phobos. What I'm looking into for my tools is creating a buffered output stream/range, one that would hold an underlying output sink where buffered writes are written to at appropriate intervals. Some of the speeds gains are quite significant, up to 50% in some cases, though 10-30% is more common.
A bit more background - It's looking like
std.algorithm.joiner
is unnecessarily slow. What started me looking at this was noticing thatstd.array.join
withwrite
tostdout
is about 4x faster thanjoiner
, despite thatjoin
allocates GC memory. (e.g.writeln(data.[join|joiner](delimiter))
). One reason is thatjoin
sends a single array towriteln
.writeln
in turn may have unnecessary overhead operating on ranges. Butjoiner
is also looks like it is materially slower than a hand written loop anyway.I'm managing to address both these issues with buffering, plus additional enhancements by multi-line buffering. This is why I don't have the details characterized. It'll take some work to separate out these interacting pieces.