No speedup or even slower on CPU with 1-bit QConvolution and QActivation

bohanzhuang commented 5 years ago

Recently I tried to run the "binary_mnist" example on CPU following the instructions. However, I failed to observe any speedup by comparing binary lenet (680 fps) and floating-point lenet (747 fps). I think the reason is that QConvolution and QActivation still perform floating-point operations even though weights and activations are 1-bit. How can I test with run-time XNOR operations? Moreover, I converted the output xxxx.params file into binary but found lower testing speed (345 fps). And I'm not sure what's the exact problem. My environment is python3.6.8 and my CPU configuration is Intel(R) Xeon(R) CPU E5-2630 v3 with 8 cores.

blueardour commented 5 years ago

same problem. I currenty would like to test the inference speedup only. Here is the test code.

`ctx = mx.cpu() phases = ['k3s1', 'k3s2', 'k1s2', 'sign'] def get_sym(phase="k3s1", bitA=32, bitW=32): data = mx.sym.Variable(name='data') output = data if phase == 'k3s1': for i in range(20): output = mx.sym.QConvolution_v1(output, num_filter=64, kernel=(3,3), stride=(1,1), pad=(1,1), nobias=True, name="k3s1%d" % i, act_bit=bitA, weight_bit=bitW, binarized_weights_only=True, cudnn_off=True) elif phase == 'k3s2': output = mx.sym.QConvolution(data, num_filter=64, kernel=(3,3), stride=(2,2), pad=(1,1), no_bias=True, name="k3s2", act_bit=bitA, weight_bit=bitW) elif phase == 'k1s2': output = mx.sym.QConvolution(data, num_filter=64, kernel=(1,1), stride=(2,2), pad=(0,0), no_bias=True, name="k1s2", act_bit=bitA, weight_bit=bitW) else: output = mx.sym.QActivation(data, name="sign") return output

image_shape = (10, 64, 224, 224) iterations = 100 from collections import namedtuple Batch = namedtuple('Batch', ['data']) def test(model): start = time.time() for i in range(iterations): data = [mx.nd.ones(image_shape)] model.forward(Batch(data)) end = time.time() print(end - start)

sym = get_sym() mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None) mod.bind(for_training=False, data_shapes=[('data', image_shape)], label_shapes=mod._label_shapes) mod.init_params() test(mod)

sym = get_sym(phase="k3s1", bitA=1, bitW=1) mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None) mod.bind(for_training=False, data_shapes=[('data', image_shape)], label_shapes=mod._label_shapes) mod.init_params() test(mod)`

The binary version is slower than the FP32 version. I don't know whehter some switch is not opened. @ry @pluskid @darxriggs Could you please give me some light?

blueardour commented 5 years ago

After add a debug information in the smd_hpi/src/q_convolution.cc, it seemed only when the input size (N C W *H) is less than a threshold, could mxnet enter functions for the xnor operations. Otherwise, functions in q_convolution-inl.h and q_convolution.cc would not be triggered.

yanghaojin commented 5 years ago

please see my answer in this track: https://github.com/hpi-xnor/BMXNet/issues/6

yanghaojin commented 5 years ago

please check our new version BMXNet v2: https://github.com/hpi-xnor/BMXNet-v2

BMXNet v2 support Gluon for training, after training you can convert your model to binary using model_converter, then use xnor forward for inference.

hpi-xnor / BMXNet

No speedup or even slower on CPU with 1-bit QConvolution and QActivation #56