incorrect application of tanh-based weight quantization

analog-cbarber commented 6 years ago

For multi-bit weight quantization you have implemented the tanh-based squashing function as described in the DoReFa-Net paper. However, instead of incorporating its derivative in the weight updates you simply apply the quantization squashing and quantization in place and ignore the derivative of the squashing operation entirely.

For comparison, here is the DoReFa-Net quantization code. Note how it replaces the gradient of the quantization rounding with identity but does not modify the gradient of the squashing operations:

   def quantize(x, k):
        n = float(2**k - 1)
        with G.gradient_override_map({"Round": "Identity"}):
            return tf.round(x * n) / n

    def fw(x):
        if bitW == 32:
            return x
        if bitW == 1:   # BWN
            with G.gradient_override_map({"Sign": "Identity"}):
                E = tf.stop_gradient(tf.reduce_mean(tf.abs(x)))
                return tf.sign(x / E) * E
        x = tf.tanh(x)
        x = x / tf.reduce_max(tf.abs(x)) * 0.5 + 0.5
        return 2 * quantize(x, bitW) - 1

    def fa(x):
        if bitA == 32:
            return x
        return quantize(x, bitA)

analog-cbarber commented 6 years ago

DoReFa-Net code is from http://dorefa.net/

analog-cbarber commented 6 years ago

My recommendation would be to save yourself the trouble of implementing the derivative expicitly and remove the tanh squashing operation from QFullyConnected and QConvolution operators and to just do clipping and rounding. If desired, the squashing operation can be done external to the Q operators, and provided as an option in the gluon Q* blocks.

hpi-xnor / BMXNet

incorrect application of tanh-based weight quantization #36