intel / webml-polyfill

Deprecated, the Web Neural Network Polyfill project has been moved to https://github.com/webmachinelearning/webnn-polyfill
Apache License 2.0
161 stars 42 forks source link

[GNA] Fix the workaround of using scale of float32 model to do int8 inference on GNA plugin #1191

Open fujunwei opened 4 years ago

fujunwei commented 4 years ago

More information seeing comments of this PR.

fujunwei commented 4 years ago

CC to @mingmingtasd @miaobin .

fujunwei commented 4 years ago

GNA plugin support online and offline quantization.

Online approach quantize model with input scale factor when loading network with calling GNAPlugin::LoadNetwork(ICNNNetwork &network), below section is to quantize model with I16 / I8.

switch (gnaPrecision) {
    case Precision::I16:
        ModelQuantizer<QuantI16> q16;
        newNet = q16.quantize(network, run_passes, inputsDesc->inputScaleFactors);
        break;
    case Precision::I8:
        ModelQuantizer<QuantI8> q8;
        newNet = q8.quantize(network, run_passes, inputsDesc->inputScaleFactors);
        break;
    default:
        THROW_GNA_EXCEPTION << "no mans land for GNA precision";
        break;
}

Offline approach is to save GNA-optimized model (non-IR) with -wg flags that is quantized when running speech_sample, then load the quantized model with calling GNAPlugin::ImportNetwork(const std::string &modelFileName), it's not suitable for our current design.

For Inference Engine with low-precision 8-bit integer inference. on Offline quantized stage, FakeQuantize layers are added before most layers to have quantized tensors before layers in a way. LAYERS GPU CPU VPU GNA FPGA SHAPEINFER
FakeQuantize Not Supported Supported Not Supported Not Supported Not Supported Supported

But GNA doesn't support the FakeQuantize layers, so we have no better approach to align with our design until now.