Closed aktau closed 2 years ago
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
For more information, open the CLA check for this pull request.
I'm not sure why this wants me to sign a CLA. This is from aktau@google.com.
UPDATE: I added my corp email to my GH account now. I wonder how I can make this check run again.
Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks
$ git checkout upstream/main
$ go test -run=NONE -bench=. ./... >> old.txt
$ git checkout aktau/main
$ go test -run=NONE -bench=. ./... > new.txt
$ benchstat old.txt new.txt
name old time/op new time/op delta
Parse-8 16.7µs ± 0% 16.8µs ± 0% ~ (p=1.000 n=1+1)
ParseParallel-8 4.97µs ± 0% 4.97µs ± 0% ~ (p=1.000 n=1+1)
name old alloc/op new alloc/op delta
Parse-8 4.84kB ± 0% 4.84kB ± 0% ~ (p=1.000 n=1+1)
ParseParallel-8 4.92kB ± 0% 4.93kB ± 0% ~ (p=1.000 n=1+1)
name old allocs/op new allocs/op delta
Parse-8 31.0 ± 0% 31.0 ± 0% ~ (all equal)
ParseParallel-8 31.0 ± 0% 31.0 ± 0% ~ (all equal)
Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks
Indeed. Even with 10 runs per test we get this:
$ git checkout main ; for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do git checkout $commit && ( cd ./jar && go test -run=NONE -bench=. -count=10 | tee ~/tmp/$commit.bench ) ; done
$ benchstat 9d3aceeff48dc332d314d938ef6f3ed7837dfc1e.bench 61ade2d03f307bde6b98ed513f68f01afb9421c9.bench
name old time/op new time/op delta
Parse-4 38.5µs ± 9% 39.5µs ± 9% ~ (p=0.408 n=7+9)
ParseParallel-4 36.4µs ±23% 31.6µs ±12% -13.42% (p=0.035 n=10+9)
name old alloc/op new alloc/op delta
Parse-4 4.92kB ± 0% 4.92kB ± 0% ~ (p=0.671 n=8+10)
ParseParallel-4 4.94kB ± 0% 4.93kB ± 0% ~ (p=0.381 n=10+10)
name old allocs/op new allocs/op delta
Parse-4 33.0 ± 0% 33.0 ± 0% ~ (all equal)
ParseParallel-4 33.0 ± 0% 33.0 ± 0% ~ (all equal)
However, due to the fact that we see consistent runtime improvements on /log4jscanner jar/testdata
, that might be an indication that the benchmarks are not representative of either the data inside testdata
, or the data we see in production.
Indeed, given the codepath I'm changing, I'd expect only nested jars to benefit (and it seems like this is fairly common in production).
To verify, I ran all the jars in testdata individually:
$ git checkout main
TMPDIR="$HOME/tmp/tmpdir"
mkdir "$TMPDIR" || true
for file in jar/testdata/* ; do
RTMPDIR="$TMPDIR/$(basename $file)"
mkdir "$RTMPDIR" || true
ln -fv "$PWD/$file" "$RTMPDIR"
done
for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
git checkout $commit
go build
dirs="$(echo $TMPDIR/* | tr ' ' ',')"
hyperfine --ignore-failure --warmup 1 --parameter-list dir "$dirs" './log4jscanner {dir}'
rm log4jscanner
done
...
Summary
'./log4jscanner /home/aktau/tmp/tmpdir/400mb.jar' ran
1.12 ± 1.12 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar.patched'
1.13 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar.patched'
1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/notarealjar.jar'
1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydir.zip'
1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar.patched'
1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.signed.jar'
1.17 ± 1.18 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corruptjar.go'
1.18 ± 1.15 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar.patched'
1.18 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar.patched'
1.18 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class-executable'
1.19 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar.patched'
1.20 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt.jar'
1.21 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar.patched'
1.23 ± 1.44 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar.patched'
1.24 ± 1.39 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar.patched'
1.24 ± 1.22 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.signed.jar'
1.24 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar.patched'
1.25 ± 1.25 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.jar'
1.25 ± 1.30 times faster than './log4jscanner /home/aktau/tmp/tmpdir/generate.sh'
1.28 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld-executable'
1.34 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar.jar'
1.38 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt_jar_in_jar.jar'
1.41 ± 1.35 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar'
1.43 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_with_invalid_jar.jar'
1.48 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar'
1.52 ± 1.52 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.jar'
1.52 ± 1.42 times faster than './log4jscanner /home/aktau/tmp/tmpdir/zipbombs'
1.55 ± 1.50 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar_in_jar.jar'
1.59 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.0-beta9.jar'
1.62 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar'
1.63 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/selenium-api-3.141.59.jar'
1.76 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar'
1.78 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/similarbutnotvuln.jar'
1.86 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.2.jar'
1.86 ± 1.63 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar'
1.94 ± 2.04 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar'
1.97 ± 1.73 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar'
2.01 ± 1.72 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.16.0.jar'
2.76 ± 2.28 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar'
3.33 ± 2.80 times faster than './log4jscanner /home/aktau/tmp/tmpdir/shadow-6.1.0.jar'
7.64 ± 7.06 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar'
9.34 ± 7.40 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar'
13.78 ± 10.56 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydirs.zip'
326.03 ± 262.82 times faster than './log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar'
Which is somewhat curious. I'd have expected 400mb.jar to not be the fastest thing to run. But individually re-running it confirms it.
Most of these invocations take so little time that it's hard to show statistically significant improvements. So we'll just focus on the four slowest ones (next comment).
(Actually, let's just focus on the slowest one. I've verified that the others don't change in speed over the 4 last commits.)
The results for 400mb_jar_in_jar, which is the only case where runtime is noticeably changed:
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
Time (mean ± σ): 1.484 s ± 0.326 s [User: 1.448 s, System: 0.999 s]
Range (min … max): 1.119 s … 2.087 s 10 runs
Previous HEAD position was 9d3acee Document commit that removed InitialContext
HEAD is now at 2932093 jar: close the zip.File reader before recursing
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
Time (mean ± σ): 1.542 s ± 0.174 s [User: 1.387 s, System: 1.076 s]
Range (min … max): 1.245 s … 1.739 s 10 runs
Previous HEAD position was 2932093 jar: close the zip.File reader before recursing
HEAD is now at f6c1ebc jar: prefer io.ReadFull over io.ReadAll
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
Time (mean ± σ): 441.6 ms ± 85.7 ms [User: 355.1 ms, System: 112.7 ms]
Range (min … max): 318.3 ms … 578.1 ms 10 runs
Previous HEAD position was f6c1ebc jar: prefer io.ReadFull over io.ReadAll
HEAD is now at 61ade2d jar: reuse buffers for nested .jar's
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
Time (mean ± σ): 340.7 ms ± 57.9 ms [User: 269.7 ms, System: 78.8 ms]
Range (min … max): 298.8 ms … 481.4 ms 10 runs
Si I added a benchmark to test 400mb_jar_in_jar specifically, and the results now do show an improvement (first versus last commit):
benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 37376ef3d7fcb898fd5e192d12a2afe4718ffc33.bench
name old time/op new time/op delta
Parse/400mb_jar_in_jar.jar-4 1.26s ±27% 0.27s ±10% -78.51% (p=0.000 n=10+10)
Parse/safe1.jar-4 38.0µs ± 9% 43.4µs ±40% +14.47% (p=0.006 n=9+9)
ParseParallel/400mb_jar_in_jar.jar-4 1.40s ±26% 0.20s ±32% -85.39% (p=0.000 n=10+9)
ParseParallel/safe1.jar-4 33.0µs ± 8% 61.4µs ±11% +86.03% (p=0.000 n=9+9)
name old alloc/op new alloc/op delta
Parse/400mb_jar_in_jar.jar-4 2.35GB ± 0% 0.05GB ±108% -97.86% (p=0.000 n=10+10)
Parse/safe1.jar-4 4.92kB ± 0% 4.92kB ± 0% ~ (p=0.810 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4 2.35GB ± 0% 0.02GB ±336% -99.18% (p=0.000 n=10+8)
ParseParallel/safe1.jar-4 4.93kB ± 0% 4.93kB ± 0% -0.08% (p=0.032 n=9+9)
name old allocs/op new allocs/op delta
Parse/400mb_jar_in_jar.jar-4 102 ±11% 28 ± 8% -72.43% (p=0.000 n=10+10)
Parse/safe1.jar-4 33.0 ± 0% 33.0 ± 0% ~ (all equal)
ParseParallel/400mb_jar_in_jar.jar-4 180 ± 3% 39 ± 9% -78.53% (p=0.000 n=10+8)
ParseParallel/safe1.jar-4 33.0 ± 0% 33.0 ± 0% ~ (all equal)
And versus just the io.ReadAll -> io.ReadFull CL:
benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 4b23cd3db5400b191c5310c07415848bc0a2bb6d.bench
name old time/op new time/op delta
Parse/400mb_jar_in_jar.jar-4 1.26s ±27% 0.30s ±17% -76.18% (p=0.000 n=10+9)
Parse/safe1.jar-4 38.0µs ± 9% 40.3µs ± 8% +6.15% (p=0.028 n=9+10)
ParseParallel/400mb_jar_in_jar.jar-4 1.40s ±26% 0.35s ±45% -74.81% (p=0.000 n=10+9)
ParseParallel/safe1.jar-4 33.0µs ± 8% 32.2µs ±15% ~ (p=0.243 n=9+10)
name old alloc/op new alloc/op delta
Parse/400mb_jar_in_jar.jar-4 2.35GB ± 0% 0.42GB ± 0% -82.14% (p=0.000 n=10+10)
Parse/safe1.jar-4 4.92kB ± 0% 4.92kB ± 0% ~ (p=0.463 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4 2.35GB ± 0% 0.42GB ± 0% -82.14% (p=0.000 n=10+10)
ParseParallel/safe1.jar-4 4.93kB ± 0% 4.93kB ± 0% ~ (p=0.159 n=9+10)
name old allocs/op new allocs/op delta
Parse/400mb_jar_in_jar.jar-4 102 ±11% 30 ± 0% -70.67% (p=0.000 n=10+6)
Parse/safe1.jar-4 33.0 ± 0% 33.0 ± 0% ~ (all equal)
ParseParallel/400mb_jar_in_jar.jar-4 180 ± 3% 52 ±17% -71.10% (p=0.000 n=10+10)
ParseParallel/safe1.jar-4 33.0 ± 0% 33.0 ± 0% ~ (all equal)
Indicating that the regression in speed of the safe1.jar
benchmark may be due to the garbage collector dealing with allocations from the other benchmarks [1].
[1]: Notice how some runs after the sync.Pool introduction allocate ~nothing:
BenchmarkParse/400mb_jar_in_jar.jar-4 4 299414355 ns/op 104883246 B/op 30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 5 249801048 ns/op 11801 B/op 26 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 4 269324904 ns/op 12186 B/op 27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 4 262044230 ns/op 12186 B/op 27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 4 259753494 ns/op 12186 B/op 27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 4 277134852 ns/op 104883246 B/op 30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 4 279076490 ns/op 104883246 B/op 30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 5 263421624 ns/op 83908649 B/op 28 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 4 264917986 ns/op 12186 B/op 27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4 4 286430956 ns/op 104883246 B/op 30 allocs/op
The pool is not cleared between benchmark runs.
Okay, taking a step back...
A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term. Particularly since our internal response has wound down, and we have fewer cycles to work on this project :) We already hit performance oddities with this parser in the past, and I don't want to have to hunt down weird behavior or bugs.
What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.
A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term.
Removing comments, it's a tiny amount of LoC. At first I had it as a simple if-check on the "put" operation. This results in the same thing but I thought it looked more ergonomic with Get/Put. Most of the code is in the test.
Particularly since our internal response has wound down, and we have fewer cycles to work on this project :)
True, but this could well improve throughput for other users, too. For us internally, it was a bottleneck. See below.
We already hit performance oddities with this parser in the past, ...
Yes, performance seemed noticeably worse after the switchover to this version, though I wasn't able to pin down immediately why even when comparing profiles.
...and I don't want to have to hunt down weird behavior or bugs.
All new code carries the risk of bugs, that's true. If there's another test you'd like that provides stronger guarantees (also for future contributions), I could see whether I could do that.
Otherwise, I'll just leave this code here for others to potentially pull in.
What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.
In the last run we did internally, this was a noticeable bottleneck. This codepath was responsible for >50% of all allocations. Allocations by themselves are very costly in Go, but the more processors (GOMAXPROCS) given to a Go program, the more of a bottleneck this becomes as there are some global datastructures involved. For us, this was very large.
You're right that the response has wound down. But I assume others can benefit from this all the same.
Okay thanks for the reply! This lgtm then
Actually, I see that the commits make sense on their own. Will merge as is
The io.ReadAll() call is responsible for a bit more than half the allocations observed in a very highly scaled up instance of this [1]. On the local testdata, these commits had quite the positive effect. I ran
go test
inside the jar/ subdirectory too, and it passed.Benchmarks:
[1]: