Pytorch bias quantisation on aarch64 - how best to proceed?

christinaburge commented 1 month ago

Hi! Following on from the issue https://github.com/oneapi-src/oneDNN/issues/1864, I have been doing some further investigation and it looks like it's relatively straightforward to add quantized bias by making these super-hacky changes to

pytorch/third_party/ideep/mkl-dnn/src/cpu/gemm_x8s8s32x_convolution_utils.cpp

            float data,b ;

            if (jcp_.with_bias) {
                // the bias is not properly quantized so this is a hack for now         
                if (jcp_.bias_data_type == dnnl_s32) {
                 b = io::load_float_value(dnnl_f32, bias, g * jcp_.oc + oc);       
                 int quant_b = static_cast<int32_t>(std::round(b / (scales[(g * jcp_.oc + oc) * jcp_.scale_idx_mult])));
                 data_s32 += quant_b;
                 // dequantize data
                 data = static_cast<float>(data_s32);
                 data *= scales[(g * jcp_.oc + oc) * jcp_.scale_idx_mult];
                }
                else {
                 b = io::load_float_value(jcp_.bias_data_type, bias, g * jcp_.oc + oc);
                 // dequantize data
                 data = static_cast<float>(data_s32);
                 data *= scales[(g * jcp_.oc + oc) * jcp_.scale_idx_mult];
                 data += b;
                } 
            }
            else{
// dequantize data
            data = static_cast<float>(data_s32);
            data *= scales[(g * jcp_.oc + oc) * jcp_.scale_idx_mult];
            }

Does that sound sensible? Or is there a better way to proceed?

kminemur commented 1 month ago

Hi @christinaburge

May we know why you need this implementation, any background?

To use your-defined bias data type in oneDNN. you can change bias memory primitives & memory descriptor in cnn_inference_int8.cpp as a reference. // auto user_bias_memory = memory({{conv_bias_tz}, dt::f32, tag::x}, eng); --> auto user_bias_memory = memory({{conv_bias_tz}, dt::s8, tag::x}, eng);

// auto conv_bias_md = memory::desc({conv_bias_tz}, dt::f32, tag::any); --> auto conv_bias_md = memory::desc({conv_bias_tz}, dt::s8, tag::any);

DNNL_VERBOSE output should be: ~/project/oneDNN/examples$ ONEDNNVERBOSE=1 ./a.out | grep bia onednn_verbose,primitive,exec,cpu,convolution,brg_conv_fwd:avx2_vnni,forward_training,src_u8:a:blocked:acdb::f0 wei_s8:a:blocked:AcdB24a4b::f0 bia_s8::a:blocked:a::f0 dst_u8:a:blocked:acdb::f0,attr-scales:src0:0:f32+dst:0:f32+wei:0:f32 attr-post-ops:eltwise_relu,alg:convolution_direct,mb8_ic256oc384_ih13oh13kh3sh1dh0ph1_iw13ow13kw3sw1dw0pw1,14.845

dzarukin commented 1 month ago

@christinaburge Though your observation is correct and one can add such support the way you suggested, it is not really helping because of two reasons: 1) The library simply un-doing what the user just did - in particular, un-scale just scaled bias on their end. Performance-wise there's no point doing twice the work for nothing. 2) Programming model says the bias is applied after applying scales, so this would contradict with a definition - quantized bias is applied before scaling but non-quantized right after.

To keep the solution clean, bias should be passed non-quantized. We haven't restricted that for some reason, but we might consider doing that in the next major version to avoid such situations. Hope it helps.

vpirogov commented 1 month ago

You can find detailed explanation of current quantization scheme in oneDNN in "Quantization and scaling" RFC.

oneapi-src / oneDNN

Pytorch bias quantisation on aarch64 - how best to proceed? #2015