How to perform inference on an hardware with alpha='auto'

I'm trying to implement inference on a hardware using the Xilinx ap_fixed for a model quantized with alpha='auto'. With alpha=1 it is straightforward. The weights (after applying the quantizer) can be exported directly to the hardware. When alpha='auto' is more challenging. I have not found an explanation on how to compute the weights and the scale, so I have analyzed the code. This is an extract of quantized_bits for alpha='auto':

m = K.pow(2.0, K.cast_to_floatx(unsigned_bits))
m_i = K.pow(2.0, K.cast_to_floatx(self.integer))
x = x / m_i
levels = (2**(self.bits-1)-1) * 2 if self.symmetric else (2**self.bits)-1
scale = (K.max(abs(x), axis=axis, keepdims=True) * 2) / levels
v = tf.floor(tf.abs(x) / scale + 0.5)
mask = v < levels / 2
z = tf.sign(x) * tf.where(mask, v, tf.ones_like(v) * levels / 2)
xq = m_i * z / m
xq2 = scale * xq

My understanding is that z contains the integer representation of the weights that utilize the entire range of the type, that is the scale is optimal. xq are the floating point representation of z. and xq2 the quantized weights in floating point representation that are actually used in the convolution during training. These can exceed the range of the type.

To implement this in the hardware I have to save z as the weights and compute scale which is a constant that have to be applied after the convolution. For alpha='po2' it would be the same but the scale can be applied as a bit shift.

If this is true, it would be nice to have a function that return z and scale as quantized_bits does not. Thanks

google / qkeras

How to perform inference on an hardware with alpha='auto' #134