less forward speed-up when batch size is larger

Hi, thanks for the great work first!

I used benchmark_score.py to evaluate the forward latency of Resnet-18 and Resnet-18-binary.

Although Resnet-18-binary speeds up 1.5x at batch size 1, the speed up decrease when I have larger batch. When I have batch size 32, they have almost the same latency. Do you know why does that happen?
The GPU performance of Resnet-18-binary is much worse than the floating point model. I understand that your optimization focused on CPU rather than GPU, but I thought binary model should have at least similar GPU performance as FP model. Why is it much worse?

Here are my running results:

INFO:root:network: resnet-18-binary INFO:root:device: gpu(0) INFO:root:batch size 1, image/sec: 16.735898 INFO:root:batch size 2, image/sec: 25.027532 INFO:root:batch size 4, image/sec: 33.737085 INFO:root:batch size 8, image/sec: 41.273390 INFO:root:batch size 16, image/sec: 47.007433 INFO:root:batch size 32, image/sec: 50.493328

INFO:root:device: cpu(0) INFO:root:batch size 1, image/sec: 6.693615 INFO:root:batch size 2, image/sec: 8.799900 INFO:root:batch size 4, image/sec: 11.307120 INFO:root:batch size 8, image/sec: 12.709365 INFO:root:batch size 16, image/sec: 12.371296 INFO:root:batch size 32, image/sec: 13.402594

INFO:root:network: resnet-18 INFO:root:device: gpu(0) INFO:root:batch size 1, image/sec: 130.296734 INFO:root:batch size 2, image/sec: 192.971986 INFO:root:batch size 4, image/sec: 271.567828 INFO:root:batch size 8, image/sec: 338.648713 INFO:root:batch size 16, image/sec: 461.010049 INFO:root:batch size 32, image/sec: 486.325190

INFO:root:device: cpu(0) INFO:root:batch size 1, image/sec: 4.363451 INFO:root:batch size 2, image/sec: 6.357484 INFO:root:batch size 4, image/sec: 8.384733 INFO:root:batch size 8, image/sec: 10.529395 INFO:root:batch size 16, image/sec: 11.955591 INFO:root:batch size 32, image/sec: 13.027583

hpi-xnor / BMXNet

less forward speed-up when batch size is larger #52