Closed simonmaurer closed 5 years ago
After we hybridize to a Symbol, we can do the inference also with C++, since we can export it to the usual symbol.json
and .param
files. We implemented the necessary functions in C++, but in a more modular way (instead of using a whole C++ QConvolution operator, we now use the normal convolution but do the functions needed for binarization before/after the default convolution operator). This is also visible in the symbolic graph now. For example, it contains the det_sign
functions as additional ops when directly exporting (you could quickly test this with the mnist example).
As for the conversion script, we are currently working on this, but it is not yet finished. It will remove unnecessary operators from the symbol.json
and convert/compress the param
file similar to the previous version.
Also, we are implementing a different custom operator, which allows for the fast inference again (but independent of the HybridBlocks used for training). This operator is going to replace our Gluon convolution blocks (during our conversion script).
@Jopyth ok, thanks. in other words for fast inference (that is the custom implementation of gemm kernels as found in https://github.com/hpi-xnor/BMXNet/tree/master/smd_hpi/src) you are still in the process of rewriting that part? for now the binary weights are still treated and saved as float32 throughout the Gluon code and the code for approximated multiplications (using XNOR and bitcount operations) is yet to be reimplemented from BMXNet v1 - is that what your comment
We do not yet support deployment and inference with binary operations and models (please use the first version of BMXNet instead if you need this).
in the ReadMe refers to?
@simonmaurer That is correct.
@Jopyth overall great job and findings in your paper. I am really interested in your work/BMXNetv1 and for realtime applications I'd like to dig into binarized networks and timing analysis (which is why I'm so eager to be able to run it in C++ including faster inference ;) ) any news regarding conversion script? also could you elaborate a bit on what is actually happening during the conversion script - I still dont quite get the point why you need to convert the symbol.json and param file when you already have implemented the underlying C/C++ operators (or is the C++ API using different operators? - might be the reason why even vanilla MXNet 1.4.0 still doesnt support reduced precision ie. float16 in C++ API) maybe because you created custom operators but only in Python?
@simonmaurer Sorry for the long wait on the reply: the conversion and execution with C++ API works for our tested models now, but we still have a little bit of cleaning up to do regarding building and CI. Good news is we also upgraded the underlying MXNet to 1.4.0 and we should be able to make the release this or next week.
Basically we need the conversion script for two reasons: the first one is the same reason as in the first BMXNet (we need to compress the binary weights with bit-packing). The second one is the one you mentioned: We use different operators between the training with Python and inference with C++. previously we had the functionality for training and inference (sped up on CPU) in the same layer and chose which version to execute based on inference setting and device. Now we have split up training and inference: training is done with multiple layers (in Gluon/Hybrid mode) but during inference we only use our one layer our (sped-up) custom convolution.
@Jopyth thanks a lot for pointing that out. looking forward to this useful addition and the upgrade to 1.4.0 - very nice!
also there's an interesting discussion regarding C vs C++ API in the official MXNet github repo. C++ API is just a frontend implementation just like Python but according to the discussion its missing some modules to make use of the fast float16 inference, see. https://github.com/apache/incubator-mxnet/issues/14159#issuecomment-483883108.
so <mxnet/c_predict_api.h>
referes to the C API that is able to do the fast inference whereas this is not yet true for C++ API <mxnet-cpp/MxNetCpp.h>
@simonmaurer Just letting you know, that BMXNet with our converter is now available. If you want to use it, please look at the Example/Test, especially the dummy forward pass before training (otherwise the model needs additional changes, by retraining the BatchNorm layers).
@Jopyth that is great! also noteworthy that you keep things updated (ie. MXNet 1.4.1) - very appreciated
closing questions I still have:
1) when you build your models - why does the QActivation come before QConvolution ? is it a special case that you use **qconv_kwargs
in QConv2D - maybe for debugging purposes as used in the code ?
2) you mentioned Example/Text:
do we just convert the model by using subprocess inside Python code (model conversion is done transparently with export
when using QActivation/QConv2D/QDense
->
output = subprocess.check_output(["build/tools/binary_converter/model-converter", param_file])
or use Binary converter as a standalone tool ?
3) how do you handle your input matrices/images(Python AND C++) ? keeping them as NDArray uint8 from OpenCV(or equivalent) or conversion to float32/float16..?
4) the fast inference (backend operators with fast GEMM) is also used when we deploy hybridized models with Python ? or only if we use a model as output from the new converter?
5) we never talked about this: a hint on how one can correctly load the converted model in C/C++, ie. which API to use for fast inference ?
BinaryConvolution
block, and for easier parameterization (e.g. clip_threshold, scaling methods, ...), added activated_conf
which uses a previously stored configuration to create such BinaryConvolution
blocks). qconv_kwargs
is just for testing different configurations of the binary convolution (with and without padding).alright, pretty enlightening! 1) thanks for pointing it out. am pretty to used to introducing non-linearities after linear combinations. does that also mean that if I have multiple QConvolutions I actually wouldnt need an activation layer anymore in front because the output of the preceeding layer (say QConv2D and QDense) is already binarized? 4) so you tested the converted model with faster inference in Python I guess? will gladly provide you with information regarding C inference. not sure yet if the C++ API (which is also only a wrapper) will work..
@Jopyth in the FAQ of the new repo v2 you're mentioning the transition to Gluon API. does that mean the underlying C/C++ implementation (ie. the backend operators that are also used by Python frontend) from BMXNet are not usable anymore? say I have created a new model with Gluon (using HybridBlocks and the QConv2D layers for example) and hybridize to Symbol, we can still do the inference with Python API but not with C/C++? in BMXNet there was a script to convert these models (Symbolic execution graph) to real binary models that can be loaded (using amalgamation.cc and/or C++ package) for faster inference..