jar: reduce redundant allocations and copies (2x-3x speedup)

The io.ReadAll() call is responsible for a bit more than half the allocations observed in a very highly scaled up instance of this [1]. On the local testdata, these commits had quite the positive effect. I ran go test inside the jar/ subdirectory too, and it passed.

Benchmarks:

$ for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
    git checkout $commit
    go build
    hyperfine -i './log4jscanner jar/testdata'
    rm log4jscanner
  done

HEAD is now at https://github.com/aktau/log4jscanner/commit/9d3aceeff48dc332d314d938ef6f3ed7837dfc1e Document commit that removed InitialContext
  Time (mean ± σ):      1.754 s ±  0.260 s    [User: 1.711 s, System: 1.264 s]
  Range (min … max):    1.411 s …  2.182 s    10 runs

HEAD is now at https://github.com/aktau/log4jscanner/commit/2932093ea7873a0df400ec12980ed6206b120ad7 jar: close the zip.File reader before recursing
  Time (mean ± σ):      1.761 s ±  0.312 s    [User: 1.778 s, System: 1.124 s]
  Range (min … max):    1.271 s …  2.239 s    10 runs

HEAD is now at https://github.com/aktau/log4jscanner/commit/f6c1ebcfb928f724f174e9c8ccd818534840a10a jar: prefer io.ReadFull over io.ReadAll
  Time (mean ± σ):     444.8 ms ±  95.3 ms    [User: 353.5 ms, System: 122.3 ms]
  Range (min … max):   337.2 ms … 683.2 ms    10 runs

HEAD is now at 5a28952 jar: reuse buffers for nested .jar's
  Time (mean ± σ):     398.0 ms ±  44.7 ms    [User: 330.6 ms, System: 89.9 ms]
  Range (min … max):   339.7 ms … 471.1 ms    10 runs

[1]:

(pprof) list checkJAR
Total: 213.06GB
         ...
         .     9.89MB    334:           f, err := zf.Open()
         .          .    335:           if err != nil {
         .          .    336:                   return fmt.Errorf("open file %s: %v", p, err)
         .          .    337:           }
         .          .    338:           defer f.Close()
         .   169.96GB    339:           data, err := io.ReadAll(f)
         .          .    340:           if err != nil {
         .          .    341:                   return fmt.Errorf("read file %s: %v", p, err)
         .          .    342:           }
         .          .    343:           br := bytes.NewReader(data)
         .     3.40GB    344:           r2, err := zip.NewReader(br, br.Size())
         ...

(pprof) list ReadAll
Total: 213.06GB
ROUTINE ======================== io.ReadAll in third_party/go/gc/src/io/io.go
  113.40GB   169.96GB (flat, cum) 79.77% of Total
         .          .    638:func ReadAll(r Reader) ([]byte, error) {
         .          .    639:   b := make([]byte, 0, 512)
         .          .    640:   for {
         .          .    641:           if len(b) == cap(b) {
         .          .    642:                   // Add more capacity (let append pick how much).
  113.40GB   113.40GB    643:                   b = append(b, 0)[:len(b)]
         .          .    644:           }
         .    56.56GB    645:           n, err := r.Read(b[len(b):cap(b)])
         .          .    646:           b = b[:len(b)+n]
         .          .    647:           if err != nil {
         .          .    648:                   if err == EOF {
         .          .    649:                           err = nil
         .          .    650:                   }

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

I'm not sure why this wants me to sign a CLA. This is from aktau@google.com.

UPDATE: I added my corp email to my GH account now. I wonder how I can make this check run again.

Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks

$ git checkout upstream/main
$ go test -run=NONE -bench=. ./... >> old.txt
$ git checkout aktau/main
$ go test -run=NONE -bench=. ./... > new.txt
$ benchstat old.txt new.txt 
name             old time/op    new time/op    delta
Parse-8            16.7µs ± 0%    16.8µs ± 0%   ~     (p=1.000 n=1+1)
ParseParallel-8    4.97µs ± 0%    4.97µs ± 0%   ~     (p=1.000 n=1+1)

name             old alloc/op   new alloc/op   delta
Parse-8            4.84kB ± 0%    4.84kB ± 0%   ~     (p=1.000 n=1+1)
ParseParallel-8    4.92kB ± 0%    4.93kB ± 0%   ~     (p=1.000 n=1+1)

name             old allocs/op  new allocs/op  delta
Parse-8              31.0 ± 0%      31.0 ± 0%   ~     (all equal)
ParseParallel-8      31.0 ± 0%      31.0 ± 0%   ~     (all equal)

Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks

Indeed. Even with 10 runs per test we get this:

$ git checkout main ; for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do git checkout $commit && ( cd ./jar && go test -run=NONE -bench=. -count=10 | tee ~/tmp/$commit.bench ) ; done
$ benchstat 9d3aceeff48dc332d314d938ef6f3ed7837dfc1e.bench 61ade2d03f307bde6b98ed513f68f01afb9421c9.bench
name             old time/op    new time/op    delta
Parse-4            38.5µs ± 9%    39.5µs ± 9%     ~     (p=0.408 n=7+9)
ParseParallel-4    36.4µs ±23%    31.6µs ±12%  -13.42%  (p=0.035 n=10+9)

name             old alloc/op   new alloc/op   delta
Parse-4            4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.671 n=8+10)
ParseParallel-4    4.94kB ± 0%    4.93kB ± 0%     ~     (p=0.381 n=10+10)

name             old allocs/op  new allocs/op  delta
Parse-4              33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel-4      33.0 ± 0%      33.0 ± 0%     ~     (all equal)

However, due to the fact that we see consistent runtime improvements on /log4jscanner jar/testdata, that might be an indication that the benchmarks are not representative of either the data inside testdata, or the data we see in production.

Indeed, given the codepath I'm changing, I'd expect only nested jars to benefit (and it seems like this is fairly common in production).

To verify, I ran all the jars in testdata individually:

$ git checkout main
TMPDIR="$HOME/tmp/tmpdir"
mkdir "$TMPDIR" || true
for file in jar/testdata/* ; do
  RTMPDIR="$TMPDIR/$(basename $file)"
  mkdir "$RTMPDIR" || true
  ln -fv "$PWD/$file" "$RTMPDIR"
done
for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
  git checkout $commit
  go build
  dirs="$(echo $TMPDIR/* | tr ' ' ',')"
  hyperfine --ignore-failure --warmup 1 --parameter-list dir "$dirs" './log4jscanner {dir}'
  rm log4jscanner
done
 ...
Summary
  './log4jscanner /home/aktau/tmp/tmpdir/400mb.jar' ran
    1.12 ± 1.12 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar.patched'
    1.13 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar.patched'
    1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/notarealjar.jar'
    1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydir.zip'
    1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar.patched'
    1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.signed.jar'
    1.17 ± 1.18 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corruptjar.go'
    1.18 ± 1.15 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar.patched'
    1.18 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar.patched'
    1.18 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class-executable'
    1.19 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar.patched'
    1.20 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt.jar'
    1.21 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar.patched'
    1.23 ± 1.44 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar.patched'
    1.24 ± 1.39 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar.patched'
    1.24 ± 1.22 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.signed.jar'
    1.24 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar.patched'
    1.25 ± 1.25 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.jar'
    1.25 ± 1.30 times faster than './log4jscanner /home/aktau/tmp/tmpdir/generate.sh'
    1.28 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld-executable'
    1.34 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar.jar'
    1.38 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt_jar_in_jar.jar'
    1.41 ± 1.35 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar'
    1.43 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_with_invalid_jar.jar'
    1.48 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar'
    1.52 ± 1.52 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.jar'
    1.52 ± 1.42 times faster than './log4jscanner /home/aktau/tmp/tmpdir/zipbombs'
    1.55 ± 1.50 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar_in_jar.jar'
    1.59 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.0-beta9.jar'
    1.62 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar'
    1.63 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/selenium-api-3.141.59.jar'
    1.76 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar'
    1.78 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/similarbutnotvuln.jar'
    1.86 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.2.jar'
    1.86 ± 1.63 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar'
    1.94 ± 2.04 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar'
    1.97 ± 1.73 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar'
    2.01 ± 1.72 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.16.0.jar'
    2.76 ± 2.28 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar'
    3.33 ± 2.80 times faster than './log4jscanner /home/aktau/tmp/tmpdir/shadow-6.1.0.jar'
    7.64 ± 7.06 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar'
    9.34 ± 7.40 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar'
   13.78 ± 10.56 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydirs.zip'
  326.03 ± 262.82 times faster than './log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar'

Which is somewhat curious. I'd have expected 400mb.jar to not be the fastest thing to run. But individually re-running it confirms it.

Most of these invocations take so little time that it's hard to show statistically significant improvements. So we'll just focus on the four slowest ones (next comment).

(Actually, let's just focus on the slowest one. I've verified that the others don't change in speed over the 4 last commits.)

The results for 400mb_jar_in_jar, which is the only case where runtime is noticeably changed:

Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):      1.484 s ±  0.326 s    [User: 1.448 s, System: 0.999 s]
  Range (min … max):    1.119 s …  2.087 s    10 runs

Previous HEAD position was 9d3acee Document commit that removed InitialContext
HEAD is now at 2932093 jar: close the zip.File reader before recursing
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):      1.542 s ±  0.174 s    [User: 1.387 s, System: 1.076 s]
  Range (min … max):    1.245 s …  1.739 s    10 runs

Previous HEAD position was 2932093 jar: close the zip.File reader before recursing
HEAD is now at f6c1ebc jar: prefer io.ReadFull over io.ReadAll
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):     441.6 ms ±  85.7 ms    [User: 355.1 ms, System: 112.7 ms]
  Range (min … max):   318.3 ms … 578.1 ms    10 runs

Previous HEAD position was f6c1ebc jar: prefer io.ReadFull over io.ReadAll
HEAD is now at 61ade2d jar: reuse buffers for nested .jar's
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):     340.7 ms ±  57.9 ms    [User: 269.7 ms, System: 78.8 ms]
  Range (min … max):   298.8 ms … 481.4 ms    10 runs

Si I added a benchmark to test 400mb_jar_in_jar specifically, and the results now do show an improvement (first versus last commit):

benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 37376ef3d7fcb898fd5e192d12a2afe4718ffc33.bench
name                                  old time/op    new time/op    delta
Parse/400mb_jar_in_jar.jar-4             1.26s ±27%     0.27s ±10%  -78.51%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       38.0µs ± 9%    43.4µs ±40%  +14.47%  (p=0.006 n=9+9)
ParseParallel/400mb_jar_in_jar.jar-4     1.40s ±26%     0.20s ±32%  -85.39%  (p=0.000 n=10+9)
ParseParallel/safe1.jar-4               33.0µs ± 8%    61.4µs ±11%  +86.03%  (p=0.000 n=9+9)

name                                  old alloc/op   new alloc/op   delta
Parse/400mb_jar_in_jar.jar-4            2.35GB ± 0%   0.05GB ±108%  -97.86%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.810 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4    2.35GB ± 0%   0.02GB ±336%  -99.18%  (p=0.000 n=10+8)
ParseParallel/safe1.jar-4               4.93kB ± 0%    4.93kB ± 0%   -0.08%  (p=0.032 n=9+9)

name                                  old allocs/op  new allocs/op  delta
Parse/400mb_jar_in_jar.jar-4               102 ±11%        28 ± 8%  -72.43%  (p=0.000 n=10+10)
Parse/safe1.jar-4                         33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel/400mb_jar_in_jar.jar-4       180 ± 3%        39 ± 9%  -78.53%  (p=0.000 n=10+8)
ParseParallel/safe1.jar-4                 33.0 ± 0%      33.0 ± 0%     ~     (all equal)

And versus just the io.ReadAll -> io.ReadFull CL:

benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 4b23cd3db5400b191c5310c07415848bc0a2bb6d.bench
name                                  old time/op    new time/op    delta
Parse/400mb_jar_in_jar.jar-4             1.26s ±27%     0.30s ±17%  -76.18%  (p=0.000 n=10+9)
Parse/safe1.jar-4                       38.0µs ± 9%    40.3µs ± 8%   +6.15%  (p=0.028 n=9+10)
ParseParallel/400mb_jar_in_jar.jar-4     1.40s ±26%     0.35s ±45%  -74.81%  (p=0.000 n=10+9)
ParseParallel/safe1.jar-4               33.0µs ± 8%    32.2µs ±15%     ~     (p=0.243 n=9+10)

name                                  old alloc/op   new alloc/op   delta
Parse/400mb_jar_in_jar.jar-4            2.35GB ± 0%    0.42GB ± 0%  -82.14%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.463 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4    2.35GB ± 0%    0.42GB ± 0%  -82.14%  (p=0.000 n=10+10)
ParseParallel/safe1.jar-4               4.93kB ± 0%    4.93kB ± 0%     ~     (p=0.159 n=9+10)

name                                  old allocs/op  new allocs/op  delta
Parse/400mb_jar_in_jar.jar-4               102 ±11%        30 ± 0%  -70.67%  (p=0.000 n=10+6)
Parse/safe1.jar-4                         33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel/400mb_jar_in_jar.jar-4       180 ± 3%        52 ±17%  -71.10%  (p=0.000 n=10+10)
ParseParallel/safe1.jar-4                 33.0 ± 0%      33.0 ± 0%     ~     (all equal)

Indicating that the regression in speed of the safe1.jar benchmark may be due to the garbage collector dealing with allocations from the other benchmarks [1].

[1]: Notice how some runs after the sync.Pool introduction allocate ~nothing:

BenchmarkParse/400mb_jar_in_jar.jar-4                  4         299414355 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  5         249801048 ns/op           11801 B/op         26 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         269324904 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         262044230 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         259753494 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         277134852 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         279076490 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  5         263421624 ns/op        83908649 B/op         28 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         264917986 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         286430956 ns/op        104883246 B/op        30 allocs/op

The pool is not cleared between benchmark runs.

Okay, taking a step back...

A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term. Particularly since our internal response has wound down, and we have fewer cycles to work on this project :) We already hit performance oddities with this parser in the past, and I don't want to have to hunt down weird behavior or bugs.

What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.

A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term.

Removing comments, it's a tiny amount of LoC. At first I had it as a simple if-check on the "put" operation. This results in the same thing but I thought it looked more ergonomic with Get/Put. Most of the code is in the test.

Particularly since our internal response has wound down, and we have fewer cycles to work on this project :)

True, but this could well improve throughput for other users, too. For us internally, it was a bottleneck. See below.

We already hit performance oddities with this parser in the past, ...

Yes, performance seemed noticeably worse after the switchover to this version, though I wasn't able to pin down immediately why even when comparing profiles.

...and I don't want to have to hunt down weird behavior or bugs.

All new code carries the risk of bugs, that's true. If there's another test you'd like that provides stronger guarantees (also for future contributions), I could see whether I could do that.

Otherwise, I'll just leave this code here for others to potentially pull in.

What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.

In the last run we did internally, this was a noticeable bottleneck. This codepath was responsible for >50% of all allocations. Allocations by themselves are very costly in Go, but the more processors (GOMAXPROCS) given to a Go program, the more of a bottleneck this becomes as there are some global datastructures involved. For us, this was very large.

You're right that the response has wound down. But I assume others can benefit from this all the same.

Okay thanks for the reply! This lgtm then

Actually, I see that the commits make sense on their own. Will merge as is

google / log4jscanner

jar: reduce redundant allocations and copies (2x-3x speedup) #57