Open flanglet opened 8 years ago
cc @randall77 @josharian @dr2chase
I'm not seeing what you are seeing.
The performance of 1.6.2 and tip are nearly identical for decode (tip is quite a bit better for encode).
I also don't see any variance between val <= 1
and val&1 == val
.
That said, the code for the inner loop in decode isn't very good. The register allocator uses more copies than are necessary.
The indexing produces code like:
y = x + 1
load from p + x + 1
The latter is a single complex-addressing mode load, and the x+1 was folded into it. In between those two instructions, both y and x are live. So the y=x+1 can't be done in place and a bunch of moves need to fix things up afterward.
I'll try to fix this up during the 1.8 cycle.
OK. I suppose the issue has been addressed between 1.7.beta1 and tip then. What remains is an optimization.
@randall77, anything left here for Go 1.8?
No, I did not get to fixing anything else here. Punting to 1.9.
I grabbed commit 3872c76b9b410c44428e74b2065f3e2291cb8095 of github.com/flanglet/kanzi based on the date of the commit and of this issue, extracted all relevant code into a single file, and updated it to use regular test/benchmark form: https://gist.github.com/josharian/e0bc6e238d4914a44289b44bc4ae3640.
This shows a steady regression over time (tip at b53acd89db5847b9ddcba076df89bef8788dd348):
name \ time/op 17 18 tip
Speed/encode-8 314µs ± 1% 406µs ± 1% 438µs ± 2%
Speed/decode-8 314µs ± 1% 333µs ± 1% 356µs ± 1%
CL 43491 applied to tip helps a fair amount:
name old time/op new time/op delta
Speed/encode-8 438µs ± 2% 374µs ± 0% -14.71% (p=0.000 n=10+9)
Speed/decode-8 356µs ± 1% 320µs ± 1% -9.93% (p=0.000 n=10+10)
That brings it back below 1.8 levels, although still not as good as 1.7.
CL 43491 is in, so we're back below the 1.8 level. Changing to milestone 1.10 for the original goal of getting back to 1.7 levels.
I spent some time (intermission at a concert) noodling with pictures of loops, and it seems to me that there's families of loops where we can arrange to have "optimal" answers, and we should probably arrange to do that in the block ordering code and then see what's left for heuristics.
For example, if there is a block in the loop that dominates all the exits and itself is conditional with an exit successor, rotate that block to the bottom, and its in-loop successor to the beginning of the run of blocks for the loop.
If we have a diamond with an exit on one of the two arms, A -> (B,C), B -> (X,D), C -> D, D -> A, order it C D A B.
In "layout", we'd do this by detecting transition into a loop, decoding the loop type, and inferring the best start block if it fits one of our models, and I think that the rest just falls out -- i.e., if you start with C, then the only logical successor is D then A, then B. This might subsume the loop rotation phase.
go1.10beta1 shows a 15% performance regression on this code vs. go 1.9.2 (see bench at [https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/benchmark/Functions_test.go]).
go1.9.2 windows/amd64
go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof
BenchmarkZRLT-8 10000 1032659 ns/op
PASS
ok command-line-arguments 10.680s
Duration: 10.60s, Total samples = 10.42s (98.28%)
Showing nodes accounting for 10.41s, 99.90% of 10.42s total
Dropped 2 nodes (cum <= 0.05s)
flat flat% sum% cum cum%
6.39s 61.32% 61.32% 6.39s 61.32% kanzi/function.(*ZRLT).Forward
4.02s 38.58% 99.90% 4.02s 38.58% kanzi/function.(*ZRLT).Inverse <===
0 0% 99.90% 10.41s 99.90% command-line-arguments.BenchmarkZRLT
0 0% 99.90% 10.41s 99.90% testing.(*B).launch
0 0% 99.90% 10.41s 99.90% testing.(*B).runN
go1.10beta1 windows/amd64
go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof
BenchmarkZRLT-8 10000 1185367 ns/op
PASS
ok command-line-arguments 12.179s
Duration: 12.10s, Total samples = 12.61s (104.20%)
Showing nodes accounting for 12.27s, 97.30% of 12.61s total
Dropped 46 nodes (cum <= 0.06s)
Showing top 10 nodes out of 24
flat flat% sum% cum cum%
6.36s 50.44% 50.44% 6.36s 50.44% kanzi/function.(*ZRLT).Forward
5.58s 44.25% 94.69% 5.58s 44.25% kanzi/function.(*ZRLT).Inverse <===
0.08s 0.63% 95.32% 0.08s 0.63% runtime.stdcall1
0.06s 0.48% 95.80% 0.13s 1.03% runtime/pprof.(*profMap).lookup
0.04s 0.32% 96.11% 0.07s 0.56% runtime.mapaccess1_fast64
0.04s 0.32% 96.43% 0.13s 1.03% runtime.ready
0.04s 0.32% 96.75% 0.17s 1.35% runtime/pprof.(*profileBuilder).addC
PUData
0.03s 0.24% 96.99% 0.07s 0.56% runtime/pprof.readProfile
0.02s 0.16% 97.15% 0.08s 0.63% runtime.findrunnable
0.02s 0.16% 97.30% 0.12s 0.95% runtime.park_m
Problematic lines: 166 and 194
ROUTINE ======================== kanzi/function.(*ZRLT).Inverse
5.58s 5.58s (flat, cum) 44.25% of Total
. . 147: runLength := 1
. . 148: srcIdx, dstIdx := 0, 0
. . 149: var err error
. . 150:
. . 151: if srcIdx < srcEnd {
410ms 410ms 152: for dstIdx < dstEnd {
90ms 90ms 153: if runLength > 1 {
260ms 260ms 154: runLength--
60ms 60ms 155: dst[dstIdx] = 0
. . 156: dstIdx++
. . 157: continue
. . 158: }
. . 159:
340ms 340ms 160: if srcIdx >= srcEnd {
. . 161: break
. . 162: }
. . 163:
770ms 770ms 164: val := src[srcIdx]
. . 165:
1.14s 1.14s 166: if val <= 1 {
. . 167: // Generate the run leng
th bit by bit (but force MSB)
. . 168: runLength = 1
. . 169:
100ms 100ms 170: for val <= 1 {
50ms 50ms 171: runLength = (run
Length << 1) | int(val)
40ms 40ms 172: srcIdx++
. . 173:
. . 174: if srcIdx >= src
End {
. . 175: break
. . 176: }
. . 177:
30ms 30ms 178: val = src[srcIdx
]
. . 179: }
. . 180:
50ms 50ms 181: continue
. . 182: }
. . 183:
. . 184: // Regular data processing
340ms 340ms 185: if val == 0xFF {
. . 186: srcIdx++
. . 187:
. . 188: if srcIdx >= srcEnd {
. . 189: break
. . 190: }
. . 191:
. . 192: dst[dstIdx] = 0xFE + src
[srcIdx]
. . 193: } else {
1.62s 1.62s 194: dst[dstIdx] = val - 1
. . 195: }
. . 196:
. . 197: srcIdx++
280ms 280ms 198: dstIdx++
. . 199: }
. . 200: }
. . 201:
. . 202: // If runLength is not 1, add trailing 0s
. . 203: end := dstIdx + runLength - 1
@flanglet I don't see the same performance issues you do. (I'm on linux/amd64, in case that makes a difference.)
$ ~/go1.9.2/bin/go test -test.bench=ZRLT -test.count=10 > log192
$ ~/goTip/bin/go test -test.bench=ZRLT -test.count=10 > logTip
$ benchstat log192 logTip
name old time/op new time/op delta
ZRLT 541µs ± 9% 547µs ± 9% ~ (p=0.393 n=10+10)
There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.
I don't see the same performance issues you do. (I'm on linux/amd64, in case that makes a difference.)
I also do not see much difference on a Linux machine. But the performance drop is reproducible every time the benchmark runs on the Windows 7 machine.
There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.
Increase the test count. I will work on reducing the noise.
@flanglet : Then I don't understand what is going on. The register allocator is identical on linux and windows, so this may be an entirely separate problem from the one that started this issue. Are your linux and windows boxes two separate machines? Might it be the CPU, not the OS?
The Windows 7 machine has an Intel Core7 2600 (Sandy bridge) CPU. The Linux machine has an Intel Core7 7700K (Kaby Lake) CPU. So. yeah, it might be CPU related.
@TocarIP , might be up your alley.
Couldn't reproduce on Sandy Bridge
model name : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
I see:
ZRLT-24 935µs ± 0% 931µs ± 1% -0.40% (p=0.009 n=10+10)
Well, that is weird. It happens every time on my machine. FWIW, I downloaded the go package from https://golang.org/dl/#go1.10beta1. Maybe you tested a slightly different version ? Anyway, I will wait for newer releases and keep on testing this case.
Please answer these questions before submitting your issue. Thanks!
go version
)? go1.6 windows/amd64 and go1.7beta1 windows/amd64go env
)? set GOARCH=amd64 set GOBIN= set GOEXE=.exe set GOHOSTARCH=amd64 set GOHOSTOS=windows set GOOS=windows set GOPATH=E:\Users\fred\Documents\Prog\kanzi\go set GORACE= set GOROOT=E:\Program Files\go set GOTOOLDIR=E:\Program Files\go\pkg\tool\windows_amd64 set GO15VENDOREXPERIMENT=1 set CC=gcc set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 set CXX=g++ set CGO_ENABLED=1Output for 1.6: Speed test Iterations: 50000
ZRLT encoding [ms]: 10694 Throughput [MB/s]: 222 ZRLT decoding [ms]: 7419 Throughput [MB/s]: 321
ZRLT encoding [ms]: 10753 Throughput [MB/s]: 221 ZRLT decoding [ms]: 7472 Throughput [MB/s]: 319
ZRLT encoding [ms]: 10724 Throughput [MB/s]: 222 ZRLT decoding [ms]: 7393 Throughput [MB/s]: 322
Output for 1.7beta1: Speed test Iterations: 50000
ZRLT encoding [ms]: 6834 Throughput [MB/s]: 348 ZRLT decoding [ms]: 11560 Throughput [MB/s]: 206
ZRLT encoding [ms]: 6828 Throughput [MB/s]: 349 ZRLT decoding [ms]: 11589 Throughput [MB/s]: 205
ZRLT encoding [ms]: 6790 Throughput [MB/s]: 351 ZRLT decoding [ms]: 11558 Throughput [MB/s]: 206
I narrowed down the issue to the run length decoding loop:
If I replace 'for val <= 1 {' with 'for val&1 == val {', the decoding becomes much faster (although not as fast as with Go 1.6)
Output for 1.7beta1 with code change:
Speed test Iterations: 50000
ZRLT encoding [ms]: 6800 Throughput [MB/s]: 350 ZRLT decoding [ms]: 7669 Throughput [MB/s]: 310
ZRLT encoding [ms]: 6813 Throughput [MB/s]: 349 ZRLT decoding [ms]: 7689 Throughput [MB/s]: 310
ZRLT encoding [ms]: 6775 Throughput [MB/s]: 351 ZRLT decoding [ms]: 7662 Throughput [MB/s]: 311