tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.39k stars 74.17k forks source link

Binary ops #1592

Closed bhack closed 6 years ago

bhack commented 8 years ago

Is there already a plan to add binary ops like bitcount for XNOR-NET?

vrv commented 8 years ago

I think this would be awesome to have, and contributions are definitely welcome here :).

bhack commented 8 years ago

@vrv Probably we need a XnorGemm first in Eigen. /cc @benoitsteiner What do you think?

vrv commented 8 years ago

Or we could also implement them as individual OpKernels if it is too difficult to get it into Eigen.

bhack commented 8 years ago

This is a reference XnorGemm (with a custom kernel) under BSD related to a previous paper.

Edit: The kernel is here

bhack commented 8 years ago

@scott-gray is working on an improved version with the upstream author. Scott will you release the code under BSD or Apache? Eigen library is currently on BSD and TF on Apache but needs cla signature.

bhack commented 8 years ago

/cc @mrastegari if interested

bhack commented 8 years ago

@benoitsteiner Do you think that this operations could be added in Eigen first?

benoitsteiner commented 8 years ago

@bhack we have a set of Eigen extensions to better support quantized operations on tensors in https://github.com/tensorflow/tensorflow/tree/master/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint. It's definitely possible to use the same approach to package a XnorGemm operation. I can also talk to the other maintainers of Eigen to check if it makes sense to add the code into the core Eigen and make it more widely available.

bhack commented 8 years ago

@benoitsteiner Yes could be useful if you can collect some upstream opinions.

bhack commented 8 years ago

8 bit quantization is available now. See merged https://github.com/tensorflow/tensorflow/pull/2230

kofd commented 8 years ago

Has there been any progress on this? It would be useful for some embedded applications where an nvidia gpu isn't an option

bhack commented 8 years ago

@kofd You can also start to read https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/quantization/index.md

bhack commented 8 years ago

/cc @petewarden

petewarden commented 8 years ago

I have been looking at 'popcount' for binary networks, as bitcount is often known, since that seems to be the trickiest part to map to processor instructions. There is some BSD-licensed work here: https://github.com/WojciechMula/sse-popcount Interestingly the x86 CPU instruction seems to be competitive with SSE implementations. It looks like ARM requires a multi-instruction macro though: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0081b/CHDJJGAJ.html

bhack commented 8 years ago

@petewarden There are also GCC built-in and llvm intrinsic. How many compilers TF wants support?

kofd commented 8 years ago

@bhack: I was talking about the bit count convolutions used in xnor net.

bhack commented 8 years ago

@petewarden A simple test of built-in with GCC and msvc at https://github.com/hanji/popcnt/blob/master/populationcount.cpp I think it is easy to add also llvm intrinsic for popcount.

zhengwy888 commented 8 years ago

I have implemented a primitive op on cpu for the XOR + bitcount. but it's too slow right now. Does any one know how to speed this up? if this ever gets to the same speed as tf.matmul than I should provide a patch. Note this is not for convolution., this is simply replacing matmul() with a xnor + bit count.

  void concatenate_col(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int rowSize = int((array.dimension(0)+31)/ 32);
      out.resize(array.dimension(1),rowSize );

      for ( int c=0; c< array.dimension(1); ++ c)
      {
          for ( int r = 0; r < rowSize; ++ r )
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int rowIdx = r*32 + i;
                  if ( rowIdx > array.dimension(0)-1 ) {
                      break;
                  }
                  sign = (array(rowIdx, c )>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(c, r) = rvalue;
          }
      }
  }
  void concatenate_row(
          typename MatMulTypes<T>::in_type array,
          MaskMatrix &out)
  {
      int colSize = int((array.dimension(1)+31 )/ 32);
      out.resize(array.dimension(0), colSize);
      for ( int r = 0; r < array.dimension(0); ++ r )
      {
          for ( int c=0; c< colSize; ++ c)
          {
              uint32_t rvalue=0;
              uint32_t sign;
              for ( int i=0; i< 32; ++i ) {
                  int colIdx = c*32 + i;
                  if ( colIdx > array.dimension(1)-1 ) {
                      break;
                  }
                  sign = (array(r, colIdx)>=0);
                  rvalue = rvalue | (sign <<i);
              }
              out(r,c) = rvalue;
          }
      }
  }

  void concatenate_and_compute(
          const CPUDevice &d,
          typename MatMulTypes<T>::in_type a,
          typename MatMulTypes<T>::in_type b,
          typename MatMulTypes<T>::out_type out)
  {
      MaskMatrix a_;
      MaskMatrix b_;

      concatenate_row(a, a_);
      concatenate_col(b, b_);

      for (int ar=0; ar < a_.rows(); ar++)
      {
          for (int br=0; br< b_.rows(); br++) {
              unsigned int Cvalue = 0;
              for (int c=0; c< a_.cols(); c++)
              {
                  unsigned int value =popcnt(a_(ar, c) ^ b_(br,c));
                  Cvalue += value;
              }
              out(ar, br) = - ( 2*(float)Cvalue - a.dimension(1) );
          }
      }

  }
ppwwyyxx commented 8 years ago

From my experience the best approach for popcnt on avx2 would be this one: https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp. But that code needs a little fix for a counter overflow. xor also needs to be done in avx2. For TF I guess there needs to be a generic non-avx2 code. There are some references in that repo as well.

bhack commented 8 years ago

@ppwwyyxx Have you benchmarked against recent GCC, MSVC, LLVM/CLANG intrinsics?

ppwwyyxx commented 8 years ago

@bhack I don't think there will be a big difference with different intrinsics. They all end up being avx2 instructions anyway.

bhack commented 8 years ago

Are you sure that all internal compiler code use avx2? I think that built-in are supported also on arm/neon.

ppwwyyxx commented 8 years ago

Oh right. if you are talking about compatibility then compiler builtins may be a good choice. But i never tried them.

isabel-schwende commented 8 years ago

Has there been any movement in this issue? I'm very interested in seeing how binary networks can be trained using tensorFlow. I have studied the work of Courbariaux and played a bit with his implementations (specifically BinaryConnect) but my final goal would be to have XNOR-Net running with tensorFlow.

lorenlugosch commented 8 years ago

would also be interested in this!

zhengwy888 commented 8 years ago

binarization can be done with tf.sign(). though the tricky part is to get the gradient backprop to work after binarizing input. for now this requires a separate op I implemented in tensorflow. https://github.com/zhengwy888/binary_ops. with this code you can implement your own XNOR net on GPU. comments/suggestions welcome.

bhack commented 8 years ago

For who is interested there is also https://arxiv.org/abs/1606.06160

ppwwyyxx commented 8 years ago

Thanks @bhack for mentioning. We have a DoReFa-Net training implementation available at dorefa.net, which doesn't make use of any custom C++ Op. Since DoReFa-Net is a generalization of XNOR-Net, XNOR-Net can be built in TF in a similar manner (without binary op acceleration). I'm also releasing a trainable DoReLa(1,2,6)-AlexNet later today.

bhack commented 8 years ago

@ppwwyyxx Nice! /cc @wangyida

isabel-schwende commented 8 years ago

Thanks for sharing everyone. But as I understood the DoReFa implementation so far, it still uses the standard 32 bit tensors. I've been thinking about ways to use the official TensorFlow quantization methods to reduce the memory at least to a quarter. The 8-bit datatype is available. Of course the question would be how to use it with the least hassle.

wangyida commented 8 years ago

In my perspective of view. released pipeline of DeRoFa is still float, and quantization module in TF could have a 8 bit representation with little performance drop, but this won't be the aim of DeRoFa which is already a quantized model just in a float representation of codes.

isabel-schwende commented 8 years ago

@Wangyida Yes, I agree that it seems that the intention of the authors of DoReFa was mostly to reduce training and inference time by using low bitwidth. I don't expect them to release an 8bit version. However they also mention in their paper the idea to use this kind of network on embedded devices. If you want to use AlexNet on a very small device, I don't see the reason to use 32 bit floats if there is no information held in them anyways. To me, the quantization to a smaller datatype is just the next logical step.

ppwwyyxx commented 8 years ago

@isabel-schwende Yes the released implementation uses float32 to hold low-bitwidth numbers, because there is simply no low-bitwidth operations available in TF. And we never planned to build such operations into TF because we already have our own low bitwidth run-time implementations working smoothly on ARM.
The released model is similar: it uses tf.float32 to hold all the binary weights as well as run all the computation. But anyone who would like to implement those binary operations can make use of our pretrained model directly and gain a speedup.

isabel-schwende commented 8 years ago

@ppwwyyxx Thank you for your clarification but I think I have to disagree at this point. Yes, there is no Datatype in TensorFlow available for 1,2 or 6 bit but there is the tf.quint8 datatype for tensors. @petewarden and his team have introduced it in their tutorial here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/quantization/index.md Sure, the method of quantization is different to what DoReFa is doing as they keep minimums and maximums as floats. I've played around with the tutorial and also used a customised AlexNet saved in a protobuf-file to quantize down to 8 bit. I was able to observe that the protobuf-file of the quantized network was indeed much smaller compared to the 32 float original. For now, the way how low-bitwidth weights/activations are used is not compatible but so I was wondering if there is a way to, let's say, create a customized version of the existing quantization tool to also decrease the memory of the DoReFa AlexNet model for inference tasks on small devices. But I guess that it would be still too much work at the moment.

ppwwyyxx commented 8 years ago

I totally agree and the code we released was never intended for running on mobiles, but for showing how to train such networks in tensorflow, as a supplementary material and proof of the paper. After all I've never seen a public release of binary-weight ImageNet models before.

I can certainly compress our models to about 30x smaller because they are essentially binary, and maybe use tf.quint8 for computation. But that doesn't make sense to me, because I think anyone who really want 1 or 2 bit level of performance & compression rate, should use a much more compact & tiny run-time toolchain as we did, instead of using tensorflow. If the speed & storage of 8 bit models is good enough for the use case, then I would certainly suggest trying the TF quantization tutorial instead of DoReFa-Net, because 8 bit models would have better accuracy.

isabel-schwende commented 8 years ago

@ppwwyyxx Thanks a lot for your answer. I definitely agree. For that reason I started with the quantization available by TensorFlow. It seems like using it for inference is currently very slow. As several others have described here https://github.com/tensorflow/tensorflow/issues/2807 the memory might be much lower for saving the model but inference time is approximately doubled, which is undesirable in our project. I've been searching for something that is fast and has low memory requirements but it seems like the available code is still in research phase. I guess we will do some hacking to get something customised for our project.

petewarden commented 8 years ago

The current slow speed of quantization is due to reference implementations for some of the ops, which we're actively working on optimizing. The goal is that quantization should be faster than float.

AaronYKing commented 8 years ago

Hi All, Is it implemented through code about XNOR-Net? If it can be implemented by Caffe?

bhack commented 8 years ago

See also https://github.com/NervanaSystems/neon/commit/caf0aaaaa1438b09c905e0780ba1120c6fd25f1c

AaronYKing commented 8 years ago

@bhack It is binary net not XNOR-Net! Anyway thank you for the info.

isabel-schwende commented 8 years ago

@AaronYKing I guess certain companies might have their own working versions of XNOR-Net but, as far as I know, there is no complete public implementation so far. It's too bad that the original authors didn't release their code for XNOR-Net for the sake for reproducibility. If I remember correctly, the authors of DoReFa-Net tried to reproduce results of the original paper but failed to obtain comparable results. As the description in the paper is a bit thin at times, it might take a while for a really comparable, complete implementation of XNOR to be released (no matter if TensorFlow or Caffe).

mrastegari commented 8 years ago

We have released our Torch code and trained models for XNOR-Net. This is not the fast implementation.

mrastegari commented 8 years ago

Here is the link: https://github.com/mrastegari/XNOR-Net

AaronYKing commented 8 years ago

@isabel-schwende Thank you very much for your patient replay. Now the author have released the torch code.

AaronYKing commented 8 years ago

@mrastegari Thank you for your contribution. I wonder if it can be implemented by Caffe and will anybody implement it in the near future?

isabel-schwende commented 8 years ago

@mrastegari thanks a lot for sharing this information with us. I think this is going to be really helpful. Do you have any numerical information of how slow this Torch version is compared to the original version using Darknet? @AaronYKing I'm always glad if there are cases where I'm positively surprised about code being shared publicly.

bhack commented 8 years ago

Is this xnor gemm kernel useful https://github.com/NervanaSystems/neon/blob/master/neon/backends/kernels/cuda/binary.py?

ppwwyyxx commented 8 years ago

@bhack This kernel works, but only about 3.4x faster than best fp32 kernels (cublas). It's mentioned in the BNN paper and https://github.com/MatthieuCourbariaux/BinaryNet/pull/1.

ghost commented 8 years ago

Just curious, is anyone seriously pursuing this? I've been working on a fast C++ implementation of XNORNets here at AI2 with @mrastegari and others for Intel and ARM CPUs. We've achieved a modicum of success, with much headroom still remaining untapped. We're kicking around the idea of producing a fast, reference CPU implementation, so it'd be good to know if someone else is already close to releasing it.

csyking commented 8 years ago

@dmitryb-ai2 @mrastegari It will be a great progress if the XNORNets can be implemented by a fast C++ for Intel and ARM CPUs, especially getting 58× faster convolutional operations and 32× memory savings as mentioned in the paper. But, sadly, I can't code it. So looking forward to your good news!