hpi-xnor / BMXNet

(New version is out: https://github.com/hpi-xnor/BMXNet-v2) BMXNet: An Open-Source Binary Neural Network Implementation Based on MXNet
Apache License 2.0
349 stars 95 forks source link

Question about the speed of binary_mnist #17

Closed campos537 closed 5 years ago

campos537 commented 6 years ago

Hey guys i trained a binary model and a normal ones (LeNet) using my own dataset adapting your code and everything went fine but when i try to measure speed using the classify method on the train_val.py file it says that the LeNet takes less time to classify the images than the BinaryLeNet, do you guys now why? It should be a lot more faster than the normal ones

yanghaojin commented 6 years ago

Hi compos537, concerning the speed issue, please refer to my answer here. And another reason is based on the update of the mxnet from 0.0.93 to 0.1.0, the convolution layer is somehow re-implemented, and resulted in a large speed up of the dot engine. But I think the main reason is still based on that the proportion of gemm computation only takes small part of the total processing time of the convolution layer (only 13%. based on our profiling analysis on binary resnet-18 model). Thus although the binary gemm engine is much faster than Cblas, but this can not take much effect in the whole inference time.

mengwanguc commented 5 years ago

Hi @yanghaojin , I understand that binary model may not be much faster than FP model, but why is it slower? In my experiment with binary MNIST, the inference latency of FP model is 1.055ms, which for binary model its 1.426ms, which adds 40% latency. So what's the extra time cost here?

yanghaojin commented 5 years ago

So basically as posted in this thread link, we don't apply binary convolution with xnor and popcount when using GPU, since our initial experimental results show that the cuda implementation with xnor and popcount is not faster than cuDNN (which is highly optimized in the hardware level), even much slower. Therefore, it makes less sense if you test the binary model with GPU. Since with this implementation, we should use GPU for the training (take advantages from cuDNN, saving all binary weights still using floating point params), and the real application scenario is only defined for CPU based devices (by which the model converter and binary convolution implemented can be used).

More technical details: if you tested the model using GPU mode, it means that additionally to the standard dot-product based convolution operation, we have to do two more things, namely, binarizing the input matrix and the weight matrix, i.e. convert the fp weights of those matrices to binary weights (+1, -1). Thus, we have to do the additional memory access and the binarization. Particularly for the binarization of input matrix, the memory access does not follow the standard serialized order but the column-wise access (a useful post about memory access found here link), which will again slower the process.

But, if you apply the "binarized" model using model converter in CPU inference mode, the binarization of weights has been done by this converter. So in the inference time, we only have to do the input matrix binarization (which is still time-consuming, but we made some optimization based on OpenMP and caching technique), and can take advantages of binary convolution operation (xnor+popcount).

One more thing, if one wants to further higher the inference speed, should try to figure out whether you have a 64bit CPU installed on the device? If so, one could compile the BMXNet source with option: set(BINARY_WORD_TYPE "uint64" CACHE STRING "Selected BINARY_WORD_TYPE") (in code). So with 64bit binary word setting, it means that your CPU will able to process 64x bit-operations (xnor, popcount) at one execution step, which creates a huge speedup compared to the standard arithmetic operations using CPU.

Further in-depth discussion can be found in our paper: https://arxiv.org/abs/1705.09864 (for more implementation details) https://arxiv.org/abs/1809.10463 (report practical lessons learned, may offer some insights for architecture design)

mengwanguc commented 5 years ago

Hi @yanghaojin , thanks for the detailed reply. Your explanation makes sense.

Also, after I apply the "binarized" model using model converter in CPU inference mode, I do see 1.5x speed up.

Good luck with your future research! Looking forward to seeing more progress on BNN problem!

kaivu1999 commented 5 years ago

Hi @yanghaojin,

So the input binarization has to be done before every layer? As there are also normal layers in between the binarised layers, also if we have all consecutive layers will that make a difference? The output of the binarised layers is binary of fp? So the input of the next layers.

Can someone point out this part in the implementation?