Alan-Turing-Ko commented 5 years ago

Hi @majianjia. Thank you for your quick response everytime. I have had accuracy test of my model using your framework. It had got 99.2% using by caffe framework, but in nnom, it dropped to 95%. Is there any way to improve this? I've used ncnn int8 many times, and it's accuracy performance was quite good. What do you think about their quantization method? https://github.com/BUG1989/caffe-int8-convert-tools I want your opinion about this problem. Thanks.

majianjia commented 5 years ago

Hi @Alan-Turing-Ko， Thanks for the info. Their results look very promising. I will check them in detail later.

To help you understand the problem you met: The quantisation method here is the same as the KWS example in the CMSIS-NN. However, we also adapted it for some constraints in CMSIS-NN backend.
The accuracy defect you see in your model might due to them.

Avgpool from CMSIS-NN will cause small values vanishing. Because it cannot shift the output to a smaller range for better resolution. This defect will drop accuracy massively.
In a complex structure, such as Inception and DenseNet, the output format are forced to be the same across all layers in the block. Since the output of them must be the same q-format. The Q is set to the contain the maximum abs of all layers's output.

Temporary solutions are

avoid average pooling.
using batch normalization after each convolutional layer, it will more or less make all layer in a block to the same output range. (The parameters of BN layer will be fused into its previous Conv layer during model conversion, so no additional cost to MCU) Hops it helps, Jianjia

Alan-Turing-Ko commented 5 years ago

Thank you. I don't using avgpool, but use maxpool. Following is part of my model structure.

Input -
ZeroPad -
Conv2D - ReLU
ZeroPad -
MaxPool -
Conv2D - ReLU
... Dense
Softmax
Output

And shift bits are mostly 0 - 3, except input and output layer of shift bits 7.

define INPUT_OUTPUT_SHIFT 7

define CONV2D_CONV1_OUTPUT_SHIFT 3

define RE_LU_CONV1_OUTPUT_SHIFT 3

define ZERO_PADDING2D2_OUTPUT_SHIFT 3

define MAX_POOLING2D_POOL1_OUTPUT_SHIFT 3

define CONV2D_CONV2A_OUTPUT_SHIFT 3

define RE_LU_CONV2A_OUTPUT_SHIFT 3

define ZERO_PADDING2D3_OUTPUT_SHIFT 3

...

define SOFTMAX_PROV_OUTPUT_SHIFT 7

Do I need batchnormalization? Is it right to shift nnom output by 7bits to get real output? And if so, hope your advice to reduce accuracy drop.

majianjia commented 5 years ago

Hi @Alan-Turing-Ko , Your model is simple enough, I don't think the problem is due to the shifts. Using BN is not only for the output shifts but also for better utilisation within 8bit data. If you are using the prediction APIs in NNoM, the rank logic might be different. The logic to rank the most possible prediction is here. https://github.com/majianjia/nnom/blob/master/src/nnom_utils.c#L98

In addition, I just met some accuracy loss while using cropping and zeropadding. I am still investigating the causes. I will let you know if there is a problem in these layers

Alan-Turing-Ko commented 5 years ago

Thanks. Looking forward to your information.

majianjia commented 5 years ago

hi @Alan-Turing-Ko , After my validation, the mentioned layers are fine. I have reviewed my quantisation process comparing to your original post, I think the reason for the accuracy loss is what you have mentioned in the links to the caffe tools. They use KLD for quantisation from TensorRT, but here we use the most simple method similar to what has provided by ARM CMSIS-NN.

I will check if I can implement the KLD in NNoM. Thank you again for the very useful info.

Alan-Turing-Ko commented 5 years ago

I am glad that my info helped you. I think if nnom can achieve less accuracy drop, then it will become standard framework in mcu deep learning.

majianjia commented 5 years ago

Hi @Alan-Turing-Ko I just implemented KLD (as a default method) for activation quantisation. I tested a few times and I saw some improvements in accuracy.
If you want to test the new method, please check the latest dev branch. The new generate_model() has a new argument kld which has been set to true by default.
Please tell me if you have problem

Alan-Turing-Ko commented 5 years ago

Hi @majianjia. I have tested your implementation. It seems work well. I used caffe model of 99.2% precision. And when I used original nnom, I've got 95% accuracy. Now with kld, I've got 97.4% accuracy. Good improvement. And I think there might be more room to improve. Actually, I've got 99% accuracy using ncnn-int8. I've analyze their method, and they use coeffs per channels in convolution layer besides kld thres. Below is their coeff table format.

Convolution1_param_0 1606.0995204687654 0.0 757.8799812975974 1932.4205208073888 0.0 1146.692408641472 309.291375779976 1578.7488114453201 409.34268143910214 0.0 0.0 0.0 370.46756840684645 1760.779801544097 579.1359430340916 1222.808364306494 1337.122325896775 0.0 1507.0069347960834 1094.0476937784795 0.0 440.6704557276128 331.8627082809129 1343.4502784764713 1095.0753402313055 900.2248578047647 1691.1611712726103 1394.6224412478944 1529.0333412510522 906.0349056473398 408.65619549461115 632.4786566926485 Convolution2_param_0 260.296456032807 121.08234071461098 313.5705359499999 71.4519828886844 100.2569287948387 185.3271882307079 517.5214526548515 176.6876327094864 170.74149224489045 97.83382194404801 183.6318667821338 119.26581176893339 137.93880983037445 368.95641813533933 169.8380758330812 97.31521924817703 0.0 149.74516122345094 113.92372340798374 156.8678104241145 0.0 292.12859262213306 0.0 240.9336423831549 412.6875349754148 112.86933138596473 0.0 244.55579728519086 110.25680458721354 147.8738388574994 104.06476704646465 150.7603281026267 ... Convolution1 127.5914900140364 Convolution2 67.83715883390283 ...

Not sure, their method is applicable to nnom. It might increase computation throughput. Thanks for kld implementation.

majianjia commented 5 years ago

Hi @Alan-Turing-Ko , Thanks for helping me testing the new algorithm. I was also testing it last night, I see an accuracy decrease when using KLP implementation. In my own model with cifar-10. wo KLD 74.45%, with kld 74.14%, PC was giving 75.4%. I am still investegating what causes the problem. Currently dense layer is also using KLD for quanz, but I doubt that whether is appropriate. Thanks for the extra info, I will check later. I will also check nncn. I think they have optimized for very low level accelerations therefore to have channelwise Q-format. In MCU we cannot use separate quantization in between channels.

The KLD I implemented to NNoM use a 2 based thresholds. Since CMSIS-NN as well as KPU in K210 only doing shifting for the threshold. So, the resolution here is worse than the original KLD method in TensorRT (the one you mention in your original post). I am not sure whether this can be improve.

majianjia / nnom

Quantizing Accuracy consideration. #52

define INPUT_OUTPUT_SHIFT 7

define CONV2D_CONV1_OUTPUT_SHIFT 3

define RE_LU_CONV1_OUTPUT_SHIFT 3

define ZERO_PADDING2D2_OUTPUT_SHIFT 3

define MAX_POOLING2D_POOL1_OUTPUT_SHIFT 3

define CONV2D_CONV2A_OUTPUT_SHIFT 3

define RE_LU_CONV2A_OUTPUT_SHIFT 3

define ZERO_PADDING2D3_OUTPUT_SHIFT 3

define SOFTMAX_PROV_OUTPUT_SHIFT 7