hpi-xnor / BMXNet-v2

BMXNet 2: An Open-Source Binary Neural Network Implementation Based on MXNet
Apache License 2.0
231 stars 33 forks source link

BMXNet transition and Gluon hybridization for inference #1

Closed simonmaurer closed 5 years ago

simonmaurer commented 5 years ago

@Jopyth in the FAQ of the new repo v2 you're mentioning the transition to Gluon API. does that mean the underlying C/C++ implementation (ie. the backend operators that are also used by Python frontend) from BMXNet are not usable anymore? say I have created a new model with Gluon (using HybridBlocks and the QConv2D layers for example) and hybridize to Symbol, we can still do the inference with Python API but not with C/C++? in BMXNet there was a script to convert these models (Symbolic execution graph) to real binary models that can be loaded (using amalgamation.cc and/or C++ package) for faster inference..

Jopyth commented 5 years ago

After we hybridize to a Symbol, we can do the inference also with C++, since we can export it to the usual symbol.json and .param files. We implemented the necessary functions in C++, but in a more modular way (instead of using a whole C++ QConvolution operator, we now use the normal convolution but do the functions needed for binarization before/after the default convolution operator). This is also visible in the symbolic graph now. For example, it contains the det_sign functions as additional ops when directly exporting (you could quickly test this with the mnist example).

As for the conversion script, we are currently working on this, but it is not yet finished. It will remove unnecessary operators from the symbol.json and convert/compress the param file similar to the previous version.

Jopyth commented 5 years ago

Also, we are implementing a different custom operator, which allows for the fast inference again (but independent of the HybridBlocks used for training). This operator is going to replace our Gluon convolution blocks (during our conversion script).

simonmaurer commented 5 years ago

@Jopyth ok, thanks. in other words for fast inference (that is the custom implementation of gemm kernels as found in https://github.com/hpi-xnor/BMXNet/tree/master/smd_hpi/src) you are still in the process of rewriting that part? for now the binary weights are still treated and saved as float32 throughout the Gluon code and the code for approximated multiplications (using XNOR and bitcount operations) is yet to be reimplemented from BMXNet v1 - is that what your comment

We do not yet support deployment and inference with binary operations and models (please use the first version of BMXNet instead if you need this).

in the ReadMe refers to?

Jopyth commented 5 years ago

@simonmaurer That is correct.

simonmaurer commented 5 years ago

@Jopyth overall great job and findings in your paper. I am really interested in your work/BMXNetv1 and for realtime applications I'd like to dig into binarized networks and timing analysis (which is why I'm so eager to be able to run it in C++ including faster inference ;) ) any news regarding conversion script? also could you elaborate a bit on what is actually happening during the conversion script - I still dont quite get the point why you need to convert the symbol.json and param file when you already have implemented the underlying C/C++ operators (or is the C++ API using different operators? - might be the reason why even vanilla MXNet 1.4.0 still doesnt support reduced precision ie. float16 in C++ API) maybe because you created custom operators but only in Python?

Jopyth commented 5 years ago

@simonmaurer Sorry for the long wait on the reply: the conversion and execution with C++ API works for our tested models now, but we still have a little bit of cleaning up to do regarding building and CI. Good news is we also upgraded the underlying MXNet to 1.4.0 and we should be able to make the release this or next week.

Jopyth commented 5 years ago

Basically we need the conversion script for two reasons: the first one is the same reason as in the first BMXNet (we need to compress the binary weights with bit-packing). The second one is the one you mentioned: We use different operators between the training with Python and inference with C++. previously we had the functionality for training and inference (sped up on CPU) in the same layer and chose which version to execute based on inference setting and device. Now we have split up training and inference: training is done with multiple layers (in Gluon/Hybrid mode) but during inference we only use our one layer our (sped-up) custom convolution.

simonmaurer commented 5 years ago

@Jopyth thanks a lot for pointing that out. looking forward to this useful addition and the upgrade to 1.4.0 - very nice! also there's an interesting discussion regarding C vs C++ API in the official MXNet github repo. C++ API is just a frontend implementation just like Python but according to the discussion its missing some modules to make use of the fast float16 inference, see. https://github.com/apache/incubator-mxnet/issues/14159#issuecomment-483883108. so <mxnet/c_predict_api.h> referes to the C API that is able to do the fast inference whereas this is not yet true for C++ API <mxnet-cpp/MxNetCpp.h>

Jopyth commented 5 years ago

@simonmaurer Just letting you know, that BMXNet with our converter is now available. If you want to use it, please look at the Example/Test, especially the dummy forward pass before training (otherwise the model needs additional changes, by retraining the BatchNorm layers).

simonmaurer commented 5 years ago

@Jopyth that is great! also noteworthy that you keep things updated (ie. MXNet 1.4.1) - very appreciated

closing questions I still have: 1) when you build your models - why does the QActivation come before QConvolution ? is it a special case that you use **qconv_kwargs in QConv2D - maybe for debugging purposes as used in the code ? 2) you mentioned Example/Text: do we just convert the model by using subprocess inside Python code (model conversion is done transparently with export when using QActivation/QConv2D/QDense -> output = subprocess.check_output(["build/tools/binary_converter/model-converter", param_file]) or use Binary converter as a standalone tool ? 3) how do you handle your input matrices/images(Python AND C++) ? keeping them as NDArray uint8 from OpenCV(or equivalent) or conversion to float32/float16..? 4) the fast inference (backend operators with fast GEMM) is also used when we deploy hybridized models with Python ? or only if we use a model as output from the new converter? 5) we never talked about this: a hint on how one can correctly load the converted model in C/C++, ie. which API to use for fast inference ?

Jopyth commented 5 years ago
  1. QActivation makes the input binary and always needs to be before a QConvolution (unless the input is already binary for some reason). Also since they belong together so closely we also added a BinaryConvolution block, and for easier parameterization (e.g. clip_threshold, scaling methods, ...), added activated_conf which uses a previously stored configuration to create such BinaryConvolution blocks). qconv_kwargs is just for testing different configurations of the binary convolution (with and without padding).
  2. As you like, so far we mostly use it as a standalone tool, I only added it for the test case (basically all lines after 62 are just for testing purposes).
  3. We have not yet implemented a complete example with C++ for this new version, but conversion to float32 would be the way to go.
  4. The model converter currently needs to be used to get the faster inference (note: it replaces the layers for training with those optimized for inference and also compresses and transforms the weights). However you can load the deployment model in Python with a SymbolBlock (this is basically done in the test case).
  5. Basically the default way to C++ inference in mxnet should still apply to our framework, except of course you need to load the converted binarized model (not yet tested - if you encounter problems, please create issues as needed).
simonmaurer commented 5 years ago

alright, pretty enlightening! 1) thanks for pointing it out. am pretty to used to introducing non-linearities after linear combinations. does that also mean that if I have multiple QConvolutions I actually wouldnt need an activation layer anymore in front because the output of the preceeding layer (say QConv2D and QDense) is already binarized? 4) so you tested the converted model with faster inference in Python I guess? will gladly provide you with information regarding C inference. not sure yet if the C++ API (which is also only a wrapper) will work..