Query about the new update about inference on CPU and GPU.

kaivu1999 commented 5 years ago

Description

I am actually interested in the speed up that I can get on CPU and GPU especially for inference.

According to the answer by @yanghaojin I have tried BMXNet v1 for the same and I get speed up on CPU of about 1.4x - 1.7x on my PC for some models but also a decrease in speed up in some case. I used : Ubuntu 16.04/64-bit platform on Intel(R) Core™ i5-8250U CPU@ 1.60GHz (supports SSE4.2)

Can you please elaborate about the update of 21st May 2019 wrt speed up ?

The update which is written in the changelog

simonmaurer commented 5 years ago

@kaivu1999 that's a really good discussion you brought up and has been of interest since the release of BMXNetv1. pointing out the findings of @yanghaojin that for training you skipped the GEMM computation and use the CuDNN implementation for training on GPUs (as per your justification), theoretically for inference (say real-time constraints on robotic platforms) XNOR computations could be speeded up on GPUs as well - did you find the reasons for this ? (eg. well optimized CuDNN code by teams of nVidia?, etc.) now only considering CPU inference which is the main focus of the whole research (as the ultimate goal is to run DL models preferably in realtime on CPU-only devices): the results of @kaivu1999 dont coincide with the results of the BMXNet papers and preceeding XNORNet paper (theoretically up to 32x), maybe we can verify this with BMXNetv2.. clearly there are other steps involved like input binarization or patch2col that add up to the total processing time as part of the convolutional operation @yanghaojin, @jopyth would be interested as well if you could elaborate a bit on that?

kaivu1999 commented 5 years ago

@simonmaurer I think in the XNORNet they mention 32x memory savings and 52x computation speeding. I would also like to ask specific which specific files should I look for in the repository to understand the optimization part?

yanghaojin commented 5 years ago

@kaivu1999 that's a really good discussion you brought up and has been of interest since the release of BMXNetv1. pointing out the findings of @yanghaojin that for training you skipped the GEMM computation and use the CuDNN implementation for training on GPUs (as per your justification), theoretically for inference (say real-time constraints on robotic platforms) XNOR computations could be speeded up on GPUs as well - did you find the reasons for this ? (eg. well optimized CuDNN code by teams of nVidia?, etc.) now only considering CPU inference which is the main focus of the whole research (as the ultimate goal is to run DL models preferably in realtime on CPU-only devices): the results of @kaivu1999 dont coincide with the results of the BMXNet papers and preceeding XNORNet paper (theoretically up to 32x), maybe we can verify this with BMXNetv2.. clearly there are other steps involved like input binarization or patch2col that add up to the total processing time as part of the convolutional operation @yanghaojin, @Jopyth would be interested as well if you could elaborate a bit on that?

using CuDNN for training because it is over optimized and much faster than the preliminary implementation of the xnor kernel in cuda, I believe an optimized xnor cuda kernel will indeed improve the speed a lot, but we are not expert for that thus leave this to the community for now. In our first paper, the evaluation we mainly compared the xnor-gemm function with the traditional DOT operators using CPU. so xnor cpu kernel is indeed much faster than the standard dot engine like CBlas, but as I mentioned here there is also huge optimization potential regarding the whole convolution layer. In BMXNetV1 our evaluation showed that the portion of xnor-gemm computation is too small compared to the overall computations of a single Qconv layer (see this screenshot this result might be outdated but I think the ratio of each part doesn't change much).

According to the 52 times speedup mentioned in the xnor paper. I think this number is also based on the xnor-gemm function, not the whole conv layer. And they only compared to the naive implementation of a dot engine, even without reporting the comparison with CBlas (atlas or others). I have seen their codes in Darknet before they removed them years ago (since they launched a startup company XNOR.ai thus remove the code from darknet).

kaivu1999 commented 5 years ago

Thank you very much, @yanghaojin for a detailed explanation.

Also I got these numbers for BMXNet v1 considering only inference.

NIN-Cifar10	VGG-11-Cifar10

Where the blue line represents the experiment where I tried to use the accelerated layers as much as possible. I accelerated the last layer (Not recommended for accuracy) and for NIN I had to approximate some of the layers in between so that the input channels are a multiple of 64 as the accelerated layers (QActi , QConv , QFullyC) seems to support input layers with size in multiples of 64.

Also for a Network with just Convolution and Activation layer pairs. In a network with 8 of such pairs where I accelerated 7 of them in the binary version, I am getting a Speed Up of 6.8x for 256 as batch_size.

Also I would really like to know about the update on a similar note. Thanking you in advance ! I will be really happy if you could help me.

I would also like to ask specific which specific files should I look for in the repository to understand the optimization part?

Can you please elaborate about the update of 21st May 2019 wrt speed up ?

The update which is written in the changelog

hpi-xnor / BMXNet-v2