tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.5k stars 323 forks source link

How to get quantized weights from QAT model? #626

Open Hackerman28 opened 3 years ago

Hackerman28 commented 3 years ago

Hi all. I've recently trained a keras implementation of ssd-keras. I've managed to run QAT training on the model and got desired the accuracy. I wanted to get the quantised weights from the QAT model saved as a H5 model. There's no support or documentation regarding this in the tensorflow website. How can I get the quantised weights from the saved model after QAT? I tried converting it to TFLite but it is not converting due to a custom layer in model definition. So it would be helpful if I can get the quantised weights alone from the saved model. Here's the code snippet for my QAT training. I am using TF 2.3.

`quantize_scope = tfmot.quantization.keras.quantize_scope

def apply_quantization_to_dense(layer): if 'priorbox' in layer.name: return layer

if isinstance(layer,tf.keras.layers.Concatenate) or isinstance(layer, tf.keras.layers.Reshape) or isinstance(layer,tf.keras.layers.Lambda): return layer

return tfmot.quantization.keras.quantize_annotate_layer(layer)

annotated_model = tf.keras.models.clone_model( model, clone_function=apply_quantization_to_dense, )

with quantize_scope({'AnchorBoxes': AnchorBoxes}): quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model)

quant_aware_model.summary() quant_aware_model.compile(optimizer=adam, loss=ssd_loss.compute_loss) quant_aware_model.fit_generator(train_generator, epochs=424, steps_per_epoch=1000, callbacks=callbacks, validation_data=val_generator, validation_steps=100, initial_epoch=414)`

daverim commented 3 years ago

Assuming you have implemented the default 8 bit scheme

for layer in keras_model.layers:
  if hasattr(layer, 'quantize_config'):
    for weight, quantizer, quantizer_vars in layer._weight_vars:
        quantized_and_dequantized = quantizer(weight, training=false, weights=quantizer_vars)
        min_var = quantizer_vars['min_var']
        max_var = quantizer_vars['max_var']
        quantized = dequantize(quantize_and_dequantized, min_var, max_var, quantizer)

Where dequantize is specific to the quantizer based on the options there, for example for num_bits = 8, per_axis = False, symmetric = True, narrow_range = False:

quantized_min = -(1 << (quantizer.num_bits - 1))
scale = min_var / quantized_min
quantized = tf.math.round(quantized_and_dequantized / scale).numpy().astype(np.int8)
LucasStromberg commented 3 years ago

@daverim does TensorFlow have any documentation on how one should specify the dequantize operator for other cases? For example, if I'm using per-tensor, asymmetrical, narrow range, 3 bit quantization, how should I define my scale and zero point variables? I currently have a makeshift implementation where the equations for scale/zp don't have any motivation, it would be nice to see how TF does it.

Further, when obtaining a quantized model in this way by looping through all layers, what should one do with layers that don't have min/max properties, for example Batch-Norm layers?

daverim commented 3 years ago

The documentation is basically the code -- in our case we use fake_quant_with_min_max_args

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/fake_quant_ops_functor.h we currently default to an 8bit scheme that matches tflite https://www.tensorflow.org/lite/performance/quantization_spec

I actually made a mistake here: should be

quantized = quantize(quantize_and_dequantized, min_var, max_var, quantizer)

The basic idea is the same though. If you want you check your code, dequantize(quantized) == quantize_and_dequantized.

For batchnorm, you should just return the non-quantized weights using layer.get_weights().

LucasStromberg commented 3 years ago

Great stuff, thank you.

Abhishekvats1997 commented 3 years ago

@daverim Thanks for the great explanation above. I had a leading doubt on this, suppose I wanted to fold all the batchnorms in the convolution after qat , this will alter the weights of the kernel. After doing so if I use the min max stored to quantize the weights is this appropriate/similar to Tflite's implementation of batchnorm folding ?

daverim commented 3 years ago

That is right, the stored min max will be incorrect if collected before folding, but used after folding batch norms. It is probably simplest to just get the values after folding and recalculate the min and max (numpy.max(...)). However, if you really want to do this calculation beforehand, the below code is basically how the min and max should be adjusted after folding is handled.

https://github.com/tensorflow/model-optimization/blob/v0.6.0/tensorflow_model_optimization/python/core/quantization/keras/layers/conv_batchnorm.py#L289

Abhishekvats1997 commented 3 years ago

@daverim Is this kind of folding emulation currently implemented during QAT? I was just wondering will there be a significant loss in accuracy of the network if emulation is not simulated during qat but done post training.

daverim commented 3 years ago

No there is no folding emulation during training any more -- this is deprecated code but still useful in understanding the calculation. This is now calculated during batchnorm folding in the tflite converter, but as you can see in the code example, it is a straight-forward float calculation. The folding itself should not lead to loss of accuracy as it is a linear operation.

Abhishekvats1997 commented 3 years ago

@daverim What a coincidence lol, the code reference you mentioned the file got removed by Tf today.

daverim commented 3 years ago

Sorry, edited the last comment to point to the current release tag. As I mentioned, that folding is no longer done in TF. However, the calculation is essentially the same in the TFLite converter

JonneryR commented 1 year ago

@daverim Hi, I want to know how to get quantized biases. biases_quantized = tf.math.round(layer.weights[1] / scale).numpy().astype(np.int32) Thank you very much.