cmd/compile: poor register allocator behavior in compression code

flanglet commented 8 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)? go1.6 windows/amd64 and go1.7beta1 windows/amd64
What operating system and processor architecture are you using (go env)? set GOARCH=amd64 set GOBIN= set GOEXE=.exe set GOHOSTARCH=amd64 set GOHOSTOS=windows set GOOS=windows set GOPATH=E:\Users\fred\Documents\Prog\kanzi\go set GORACE= set GOROOT=E:\Program Files\go set GOTOOLDIR=E:\Program Files\go\pkg\tool\windows_amd64 set GO15VENDOREXPERIMENT=1 set CC=gcc set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 set CXX=g++ set CGO_ENABLED=1
What did you do? I ran "go build TestZRLT.go" then "TestZRLT.exe" both for Go 1.6 and 1.7 beta1 The source code is very simple: https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/test/TestZRLT.go. It runs a correctness and a performance test tor the Zero Run Length Transform: https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/function/ZRLT.go.
What did you expect to see? I expected to see no performance regression from 1.6 to 1.7beta1
What did you see instead? ZRLT encoding is much faster with 1.7beta1 but decoding is much slower.

Output for 1.6: Speed test Iterations: 50000

ZRLT encoding [ms]: 10694 Throughput [MB/s]: 222 ZRLT decoding [ms]: 7419 Throughput [MB/s]: 321

ZRLT encoding [ms]: 10753 Throughput [MB/s]: 221 ZRLT decoding [ms]: 7472 Throughput [MB/s]: 319

ZRLT encoding [ms]: 10724 Throughput [MB/s]: 222 ZRLT decoding [ms]: 7393 Throughput [MB/s]: 322

Output for 1.7beta1: Speed test Iterations: 50000

ZRLT encoding [ms]: 6834 Throughput [MB/s]: 348 ZRLT decoding [ms]: 11560 Throughput [MB/s]: 206

ZRLT encoding [ms]: 6828 Throughput [MB/s]: 349 ZRLT decoding [ms]: 11589 Throughput [MB/s]: 205

ZRLT encoding [ms]: 6790 Throughput [MB/s]: 351 ZRLT decoding [ms]: 11558 Throughput [MB/s]: 206

I narrowed down the issue to the run length decoding loop:

for val <= 1 {
    runLength = (runLength << 1) | int(val)
    srcIdx++

    if srcIdx >= srcEnd {
            break
    }

    val = src[srcIdx]
}

If I replace 'for val <= 1 {' with 'for val&1 == val {', the decoding becomes much faster (although not as fast as with Go 1.6)

Output for 1.7beta1 with code change:

Speed test Iterations: 50000

ZRLT encoding [ms]: 6800 Throughput [MB/s]: 350 ZRLT decoding [ms]: 7669 Throughput [MB/s]: 310

ZRLT encoding [ms]: 6813 Throughput [MB/s]: 349 ZRLT decoding [ms]: 7689 Throughput [MB/s]: 310

ZRLT encoding [ms]: 6775 Throughput [MB/s]: 351 ZRLT decoding [ms]: 7662 Throughput [MB/s]: 311

adg commented 8 years ago

cc @randall77 @josharian @dr2chase

randall77 commented 8 years ago

I'm not seeing what you are seeing. The performance of 1.6.2 and tip are nearly identical for decode (tip is quite a bit better for encode). I also don't see any variance between val <= 1 and val&1 == val.

That said, the code for the inner loop in decode isn't very good. The register allocator uses more copies than are necessary.

The indexing produces code like:

  y = x + 1
  load from p + x + 1

The latter is a single complex-addressing mode load, and the x+1 was folded into it. In between those two instructions, both y and x are live. So the y=x+1 can't be done in place and a bunch of moves need to fix things up afterward.

I'll try to fix this up during the 1.8 cycle.

flanglet commented 8 years ago

OK. I suppose the issue has been addressed between 1.7.beta1 and tip then. What remains is an optimization.

rsc commented 8 years ago

@randall77, anything left here for Go 1.8?

randall77 commented 8 years ago

No, I did not get to fixing anything else here. Punting to 1.9.

josharian commented 7 years ago

I grabbed commit 3872c76b9b410c44428e74b2065f3e2291cb8095 of github.com/flanglet/kanzi based on the date of the commit and of this issue, extracted all relevant code into a single file, and updated it to use regular test/benchmark form: https://gist.github.com/josharian/e0bc6e238d4914a44289b44bc4ae3640.

This shows a steady regression over time (tip at b53acd89db5847b9ddcba076df89bef8788dd348):

name \ time/op  17          18          tip
Speed/encode-8  314µs ± 1%  406µs ± 1%  438µs ± 2%
Speed/decode-8  314µs ± 1%  333µs ± 1%  356µs ± 1%

josharian commented 7 years ago

CL 43491 applied to tip helps a fair amount:

name            old time/op  new time/op  delta
Speed/encode-8   438µs ± 2%   374µs ± 0%  -14.71%  (p=0.000 n=10+9)
Speed/decode-8   356µs ± 1%   320µs ± 1%   -9.93%  (p=0.000 n=10+10)

That brings it back below 1.8 levels, although still not as good as 1.7.

josharian commented 7 years ago

CL 43491 is in, so we're back below the 1.8 level. Changing to milestone 1.10 for the original goal of getting back to 1.7 levels.

dr2chase commented 7 years ago

I spent some time (intermission at a concert) noodling with pictures of loops, and it seems to me that there's families of loops where we can arrange to have "optimal" answers, and we should probably arrange to do that in the block ordering code and then see what's left for heuristics.

For example, if there is a block in the loop that dominates all the exits and itself is conditional with an exit successor, rotate that block to the bottom, and its in-loop successor to the beginning of the run of blocks for the loop.

If we have a diamond with an exit on one of the two arms, A -> (B,C), B -> (X,D), C -> D, D -> A, order it C D A B.

In "layout", we'd do this by detecting transition into a loop, decoding the loop type, and inferring the best start block if it fits one of our models, and I think that the rest just falls out -- i.e., if you start with C, then the only logical successor is D then A, then B. This might subsume the loop rotation phase.

flanglet commented 6 years ago

go1.10beta1 shows a 15% performance regression on this code vs. go 1.9.2 (see bench at [https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/benchmark/Functions_test.go]).

go1.9.2 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1032659 ns/op
PASS
ok      command-line-arguments  10.680s

Duration: 10.60s, Total samples = 10.42s (98.28%)
Showing nodes accounting for 10.41s, 99.90% of 10.42s total
Dropped 2 nodes (cum <= 0.05s)
      flat  flat%   sum%        cum   cum%
     6.39s 61.32% 61.32%      6.39s 61.32%  kanzi/function.(*ZRLT).Forward 
     4.02s 38.58% 99.90%      4.02s 38.58%  kanzi/function.(*ZRLT).Inverse  <===
         0     0% 99.90%     10.41s 99.90%  command-line-arguments.BenchmarkZRLT
         0     0% 99.90%     10.41s 99.90%  testing.(*B).launch 
         0     0% 99.90%     10.41s 99.90%  testing.(*B).runN

go1.10beta1 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1185367 ns/op
PASS
ok      command-line-arguments  12.179s

Duration: 12.10s, Total samples = 12.61s (104.20%)

Showing nodes accounting for 12.27s, 97.30% of 12.61s total
Dropped 46 nodes (cum <= 0.06s)
Showing top 10 nodes out of 24 
      flat  flat%   sum%        cum   cum%
     6.36s 50.44% 50.44%      6.36s 50.44%  kanzi/function.(*ZRLT).Forward
     5.58s 44.25% 94.69%      5.58s 44.25%  kanzi/function.(*ZRLT).Inverse   <===
     0.08s  0.63% 95.32%      0.08s  0.63%  runtime.stdcall1
     0.06s  0.48% 95.80%      0.13s  1.03%  runtime/pprof.(*profMap).lookup
     0.04s  0.32% 96.11%      0.07s  0.56%  runtime.mapaccess1_fast64
     0.04s  0.32% 96.43%      0.13s  1.03%  runtime.ready
     0.04s  0.32% 96.75%      0.17s  1.35%  runtime/pprof.(*profileBuilder).addC
PUData
     0.03s  0.24% 96.99%      0.07s  0.56%  runtime/pprof.readProfile
     0.02s  0.16% 97.15%      0.08s  0.63%  runtime.findrunnable
     0.02s  0.16% 97.30%      0.12s  0.95%  runtime.park_m

Problematic lines: 166 and 194


ROUTINE ======================== kanzi/function.(*ZRLT).Inverse 
     5.58s      5.58s (flat, cum) 44.25% of Total
         .          .    147:   runLength := 1
         .          .    148:   srcIdx, dstIdx := 0, 0
         .          .    149:   var err error
         .          .    150:
         .          .    151:   if srcIdx < srcEnd {
     410ms      410ms    152:           for dstIdx < dstEnd {
      90ms       90ms    153:                   if runLength > 1 {
     260ms      260ms    154:                           runLength--
      60ms       60ms    155:                           dst[dstIdx] = 0
         .          .    156:                           dstIdx++
         .          .    157:                           continue
         .          .    158:                   }
         .          .    159:
     340ms      340ms    160:                   if srcIdx >= srcEnd {
         .          .    161:                           break
         .          .    162:                   }
         .          .    163:
     770ms      770ms    164:                   val := src[srcIdx]
         .          .    165:
     1.14s      1.14s    166:                   if val <= 1 {
         .          .    167:                           // Generate the run leng
th bit by bit (but force MSB)
         .          .    168:                           runLength = 1
         .          .    169:
     100ms      100ms    170:                           for val <= 1 {
      50ms       50ms    171:                                   runLength = (run
Length << 1) | int(val)
      40ms       40ms    172:                                   srcIdx++
         .          .    173:
         .          .    174:                                   if srcIdx >= src
End {
         .          .    175:                                           break
         .          .    176:                                   }
         .          .    177:
      30ms       30ms    178:                                   val = src[srcIdx
]
         .          .    179:                           }
         .          .    180:
      50ms       50ms    181:                           continue
         .          .    182:                   }
         .          .    183:
         .          .    184:                   // Regular data processing
     340ms      340ms    185:                   if val == 0xFF {
         .          .    186:                           srcIdx++
         .          .    187:
         .          .    188:                           if srcIdx >= srcEnd {
         .          .    189:                                   break
         .          .    190:                           }
         .          .    191:
         .          .    192:                           dst[dstIdx] = 0xFE + src
[srcIdx]
         .          .    193:                   } else {
     1.62s      1.62s    194:                           dst[dstIdx] = val - 1
         .          .    195:                   }
         .          .    196:
         .          .    197:                   srcIdx++
     280ms      280ms    198:                   dstIdx++
         .          .    199:           }
         .          .    200:   }
         .          .    201:
         .          .    202:   // If runLength is not 1, add trailing 0s
         .          .    203:   end := dstIdx + runLength - 1

randall77 commented 6 years ago

@flanglet I don't see the same performance issues you do. (I'm on linux/amd64, in case that makes a difference.)

$ ~/go1.9.2/bin/go test -test.bench=ZRLT -test.count=10 > log192
$ ~/goTip/bin/go test -test.bench=ZRLT -test.count=10 > logTip
$ benchstat log192 logTip 
name  old time/op  new time/op  delta
ZRLT   541µs ± 9%   547µs ± 9%   ~     (p=0.393 n=10+10)

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

flanglet commented 6 years ago

I don't see the same performance issues you do. (I'm on linux/amd64, in case that makes a difference.)

I also do not see much difference on a Linux machine. But the performance drop is reproducible every time the benchmark runs on the Windows 7 machine.

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

Increase the test count. I will work on reducing the noise.

randall77 commented 6 years ago

@flanglet : Then I don't understand what is going on. The register allocator is identical on linux and windows, so this may be an entirely separate problem from the one that started this issue. Are your linux and windows boxes two separate machines? Might it be the CPU, not the OS?

flanglet commented 6 years ago

The Windows 7 machine has an Intel Core7 2600 (Sandy bridge) CPU. The Linux machine has an Intel Core7 7700K (Kaby Lake) CPU. So. yeah, it might be CPU related.

randall77 commented 6 years ago

@TocarIP , might be up your alley.

TocarIP commented 6 years ago

Couldn't reproduce on Sandy Bridge model name : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz I see: ZRLT-24 935µs ± 0% 931µs ± 1% -0.40% (p=0.009 n=10+10)

flanglet commented 6 years ago

Well, that is weird. It happens every time on my machine. FWIW, I downloaded the go package from https://golang.org/dl/#go1.10beta1. Maybe you tested a slightly different version ? Anyway, I will wait for newer releases and keep on testing this case.

golang / go

cmd/compile: poor register allocator behavior in compression code #16122