Open fujunwei opened 4 years ago
CC to @mingmingtasd @miaobin .
GNA plugin support online and offline quantization.
Online approach quantize model with input scale factor when loading network with calling GNAPlugin::LoadNetwork(ICNNNetwork &network), below section is to quantize model with I16 / I8.
switch (gnaPrecision) {
case Precision::I16:
ModelQuantizer<QuantI16> q16;
newNet = q16.quantize(network, run_passes, inputsDesc->inputScaleFactors);
break;
case Precision::I8:
ModelQuantizer<QuantI8> q8;
newNet = q8.quantize(network, run_passes, inputsDesc->inputScaleFactors);
break;
default:
THROW_GNA_EXCEPTION << "no mans land for GNA precision";
break;
}
Offline approach is to save GNA-optimized model (non-IR) with -wg flags that is quantized when running speech_sample, then load the quantized model with calling GNAPlugin::ImportNetwork(const std::string &modelFileName), it's not suitable for our current design.
For Inference Engine with low-precision 8-bit integer inference. on Offline quantized stage, FakeQuantize layers are added before most layers to have quantized tensors before layers in a way. LAYERS | GPU | CPU | VPU | GNA | FPGA | SHAPEINFER |
---|---|---|---|---|---|---|
FakeQuantize | Not Supported | Supported | Not Supported | Not Supported | Not Supported | Supported |
But GNA doesn't support the FakeQuantize layers, so we have no better approach to align with our design until now.
More information seeing comments of this PR.