[GNA] Fix the workaround of using scale of float32 model to do int8 inference on GNA plugin

fujunwei commented 4 years ago

More information seeing comments of this PR.

fujunwei commented 4 years ago

CC to @mingmingtasd @miaobin .

fujunwei commented 4 years ago

GNA plugin support online and offline quantization.

Online approach quantize model with input scale factor when loading network with calling GNAPlugin::LoadNetwork(ICNNNetwork &network), below section is to quantize model with I16 / I8.

switch (gnaPrecision) {
    case Precision::I16:
        ModelQuantizer<QuantI16> q16;
        newNet = q16.quantize(network, run_passes, inputsDesc->inputScaleFactors);
        break;
    case Precision::I8:
        ModelQuantizer<QuantI8> q8;
        newNet = q8.quantize(network, run_passes, inputsDesc->inputScaleFactors);
        break;
    default:
        THROW_GNA_EXCEPTION << "no mans land for GNA precision";
        break;
}

Offline approach is to save GNA-optimized model (non-IR) with -wg flags that is quantized when running speech_sample, then load the quantized model with calling GNAPlugin::ImportNetwork(const std::string &modelFileName), it's not suitable for our current design.

For Inference Engine with low-precision 8-bit integer inference. on Offline quantized stage, FakeQuantize layers are added before most layers to have quantized tensors before layers in a way. LAYERS	GPU	CPU	VPU	GNA	FPGA	SHAPEINFER
FakeQuantize	Not Supported	Supported	Not Supported	Not Supported	Not Supported	Supported

But GNA doesn't support the FakeQuantize layers, so we have no better approach to align with our design until now.

intel / webml-polyfill

[GNA] Fix the workaround of using scale of float32 model to do int8 inference on GNA plugin #1191