Quantization 1 - Review

Review of Guide to Quantization and Quantization Aware Training using the TensorFlow Model Optimization Toolkit

TF's model optimization Toolkit (TFMOT) contains tools that you can use to quantize and prune your model for faster inference in edge devices.

What are edge devices?
What is inference? Paint me a picture.
Who is the target audience for this blog?
What are all the concepts this blog will address?
Why should the user spend 20-30 minutes of his time reading this blog?

Quantization is where we convert the weights of our model from high precision Floats, to low precision INT8 weights.

How precise is a high precision Float?
Aren't we losing information?

In weight quantization, we only quantize the weights and then upconvert the saved weights during inference.

What's upconvert?
Need references to TFLite and QAT

post training quantization

Caps

On the other hand, quantization aware training (QAT), emulates quantized weights during the training process.

What does emulation mean in this context?

Instead, we will use the nightly version of TensorFlow (issue)

Add refs to footnotes.

We will use the MobileNetV2 model in this example, so we need to import that.

Why this model?

We will use a Global Average Pooling layer after the MobileNet output. This will be followed by two fully connected layers and an output layer with 9 neurons and a softmax activation function.

Why? Why? Why?

More specifically, each layer in the model, is changed to their quantization aware equivalent operation.

Shouldn't it be quantization-aware ?

Note: We have not tested these reasons and there could be other causes.

Shouldn't we check the literature for answers?

scicafe / scicafe.github.io

Quantization 1 - Review #12