Performance tweak for CharSink.write(CharSequence) and related methods

GoogleCodeExporter commented 9 years ago

Hi there. I set-out to replace a little utility with:

  Files.write(CharSequence, File, Charset)

But a quick microbenchmark found Guava's method to be slower on sustained 
writes of large Strings to individual files. (e.g. +33% longer for 16MB 
strings.) Results targeting HDD and SSD were similar. Surprisingly, peak memory 
was heavily impacted too. (I guess Java uses a big in-process write-buffer 
somewhere.)

Here's a tweak in CharSink which I think will give memory & speed improvements 
to the Files method above (and maybe others):

   public void write(CharSequence charSequence) throws IOException {
     checkNotNull(charSequence);

     Closer closer = Closer.create();
     try {
-      Writer out = closer.register(openStream());
+      Writer out = closer.register(charSequence.length() > 8192 ? 
openBufferedStream() : openStream());
       out.append(charSequence);
     //...

In the case of OutputStreamWriter / FileWriter, this change ultimately calls 
into StreamEncoder.write(char[], int, int) which can use BufferedWriter's char 
buffer directly; rather than StreamEncoder.write(String, int, int) which makes 
a copy of the entire string.

The default implementation in Writer also get the same benefit when wrapped by 
BufferedWriter, so I think this will benefit a range of Writer classes, not 
just file writers.

I couldn't test the proposed tweak directly, so here's my actual tests which 
appear to support the theory:

Harness:

    String TEXT = Strings.repeat(
            "12345678901234567890123456789012345678901234567890123456789012345678901234567890",
            200_000);

    for (int i = 0; i < 40; i++) {
        long t0 = System.nanoTime();
        File out = new File("T:\\tmp\\i" + i);
        doWrite(out);
        System.out.println((System.nanoTime() - t0) / 1_000_000f);
    }

With these implementations of doWrite:

  Test 1: 44ms per file, 1590 MB peak memory**

        Files.write(TEXT, out, Charset.defaultCharset());

  Test 2: 44ms per file, 1590 MB peak memory

        try (Writer w = new FileWriter(out)) {
            w.append(TEXT);
            w.flush();
        }

  Test 3: 32ms per file, 366 MB peak memory

        try (Writer w = new BufferedWriter(new FileWriter(out))) {
            w.append(TEXT);
            w.flush();
        }

Notes: 
 * Tested using Java 8 32b running on Win 64b on an i7-860
 * 8192 matches BufferedWriter's buffer, maybe it can be turned further. Or maybe just using BufferedWriters always might be sufficient.
 * A similar tweak may be applicable to ByteSink? (Not tried.)

** memory = "Peak Private Bytes" according to SysInternal's Process Explorer.

Original issue reported on code.google.com by luke.ush...@gmail.com on 23 Sep 2014 at 1:37

GoogleCodeExporter commented 9 years ago

Interesting... my expectation would be that a Writer implementation, given a 
single large String, should be able to handle it as efficiently as possible 
since it has all the cards so to speak. I imagine BufferedWriter to be more 
useful for collecting many small writes.

But it looks like regardless of how large a chunk you provide the characters to 
the Writer in, an OutputStreamWriter will only write out 8192 bytes at a time, 
so copying the large String's char array is basically just a big waste of time. 
It's actually really surprising to me that StreamEncoder doesn't handle this 
differently... it could just wrap the String in a CharBuffer and use that with 
CharsetEncoder without needing to copy the char array at all. Maybe there's 
some good reason for that I'm not aware of.

Anyway, this does look like a good potential solution.

> A similar tweak may be applicable to ByteSink? (Not tried.)

This seems less likely since byte arrays don't need to go through encoding but 
can just be passed directly to a native method to be written.

Original comment by cgdecker@google.com on 23 Sep 2014 at 4:02

Changed state: Accepted
Added labels: Package-IO, Type-Enhancement

GoogleCodeExporter commented 9 years ago

> an OutputStreamWriter will only write out 8192 bytes at a time (...)

Huh, so it does. I missed that. 

> byte arrays don't need to go through encoding but can just be passed directly 
to a native method to be written

Indeed. Sorry I should have taken a few minutes to check that.

> It's actually really surprising to me that StreamEncoder doesn't handle this 
differently (...) Maybe there's some good reason for that I'm not aware of.

I agree. Possibly because StreamEncoder is itself a Writer, and the JDK Writers 
seem to follow this (undocumented?) pattern:
 1. call write(char[], int, int) to do the real work
 2. only call it once

(1) makes some sense when there's lots of overridden and overridable methods in 
a deep hierarchy. (But maybe it hints of yearning for a better design?)

(2) could be concern about per-call-costs in derived classes? (multiply-wrapped 
Writers, synchronisation, class CarrierPidgeonIpWriter {handy in the days of 
NSA snooping} etc.) Or just trying to be more friendly to overridden classes?

Anyway, I'd still rate performance as a higher concern for StreamEncoder than 
these other things... but no point crying over Java's spilt milk, I guess. It 
ain't going to help ;-)

Original comment by luke.ush...@gmail.com on 23 Sep 2014 at 7:35

GoogleCodeExporter commented 9 years ago

This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<issue id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:08

Added labels: MigratedToGitHub

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:17

Changed state: Migrated

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:07

Added labels: Restrict-AddIssueComment-Commit

mkodekar / guava-libraries

Performance tweak for CharSink.write(CharSequence) and related methods #1853