mrusci / training-mixed-precision-quantized-networks

This repository containts the pytorch scripts to train mixed-precision networks for microcontroller deployment, based on the memory contraints of the target device.
49 stars 16 forks source link
edge-ai integer-arithmetic low-power-mcu mixed-precision-training pytorch qnn quantized-neural-networks

Training Mixed-Precision Quantized Neural Networks for microcontroller deployments

Description

This project targets quantization-aware training methodologies on Pytorch for microcontroller deployment of quantized neural networks. The featured mixed-precision quantization techniques aim at byte or sub-byte quantization, i.e. INT8, INT4, INT2. The generated network for deployment supports integer arithmetic only. Optionally, the selection of individual per-tensor bit precision is driven by the device memory constraints.

Reference

Please, cite this paper arXiv when using the code.

@article{rusci2019memory,
  title={Memory-Driven Mixed Low Precision Quantization For Enabling Deep Network Inference On Microcontrollers},
  author={Rusci, Manuele and Capotondi, Alessandro and Benini, Luca},
  journal={arXiv preprint arXiv:1905.13082},
  year={2019}
}

Questions

For any question just drop me an email.

Getting Started

Prerequisites

Setup

Set the correct dataset paths inside data.py . As an example:

_IMAGENET_MAIN_PATH = '/home/user/ImagenetDataset/'
_DATASETS_MAIN_PATH = './datasets/'

To download pretrained mobilenet weights:

$ cd models/mobilenet_tf/
$ source download_pretrained_mobilenet.sh

Quickstart

For quantization-aware retraining of a 8-bit integer only mobilenet model type:

$ python3 main_binary.py -a mobilenet --mobilenet_width 1.0 --mobilenet_input 224 --save Imagenet/mobilenet_224_1.0_w8a8 --dataset imagenet --type_quant 'PerLayerAsymPACT' --weight_bits 8 --activ_bits 8 --activ_type learned --gpus 0,1,2,3 -j 8 --epochs 12 -b 128 --save_check --quantizer --batch_fold_delay 1 --batch_fold_type folding_weights

Quantization Options

Reproducing paper results

For any given mobilenet model, run the script with:

As an example:

$ python3 main_binary.py --model mobilenet --save Imagenet_ARM/mobilenet_128_0.75_quant_auto_tt --mobilenet_width 0.75 --mobilenet_input 128 --dataset imagenet -j 32 --epochs 10 -b 128 --save_check --gpus 0,1,2,3 --type_quant PerLayerAsymPACT --activ_type learned --quantizer --batch_fold_delay 1 --batch_fold_type folding_weights --mem_constraint [2048000,512000] --mixed_prec_quant MixPL

Quantization Strategy Guide

Overview

The quantization functions are located into quantization/quantop.py. The operator QuantOp wraps the full-precision model to handle weight quantization. As a usage example:

import quantization 
quantizer = quantization.QuantOp(model, type_quant, weight_bits, \
            batch_fold_type=args.batch_fold_type, batch_fold_delay=batch_fold_delay, \
            act_bits=activ_bits, add_config = quant_add_config )

The operator QuantOp after wrapping a full-precision model:

At training time, the quantizer works in combination with the optimizer:

  # weight quantization before the forward pass
  quantizer.store_and_quantize() # copy the real-value weights and quantize the actual ones

  # forward pass
  output = model(input) # compute output
  loss = criterion(output, target) # compute loss

  if training:
      # backward pass
      optimizer.zero_grad()
      loss.backward()

      quantizer.restore_real_value()  # restore real value parameters          
      quantizer.backprop_quant_gradients() # compute gradients wrt to real-value weights      

      optimizer.step() # update the values

  else:
      quantizer.restore_real_value() # restore real-value weights after forward pass

Weight Quantization

Currently, the following quantization schemes are supported:

Activation Quantization

At the present stage, the quantized activation layers must be part of the model definition itself. This is why the input model is already a fake-quantized model. See 'models/mobilenet.py' as an example. This part will be improved with automatic graph analysis and parsing, to turn a full-precision input model into a fake-quantized one.

Limitations

This project does not include any graph analysis tools. Hence, the graph parser (see __init__ of QuantOp operator) is specific for the tested model 'models/mobilenet.py', which already includes quantized activation layers. A rework of this part may be necessary to apply the implemented techniques on any other models.