Cannot run some network with spike

J-Zenk commented 4 years ago

Describe the bug When run onnx models here https://github.com/pranav-prakash/onnxruntime-riscv/releases/tag/v0.01, I got bad syscall #131! I have tried googlenet_quantized.onnx, mobilenet_quantized_optimized.onnx and resnet50_quantized.onnx, only the resnet could run normally.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux CentOS 7
ONNX Runtime installed from (source or binary): source
Python version:Python: 2.7.5 Python3: 3.6.8

To Reproduce

git clone chipyard and esp-tools
follow the instruction in chipyard to install esp-tools in chipyard and follow the instruction in systolic_runner to patch riscv-pk in esp-tools and install spike & pk in esp-tools
change PATH to run the patched spike and pk
git clone this project
follow the instruction in systolic_runner to build onnx and ort_test
download googlenet_quantized.onnx from https://github.com/pranav-prakash/onnxruntime-riscv/releases/tag/v0.01

run spike --extension=gemmini pk ort_test -m googlenet_quantized.onnx -i images/cat.jpg -p caffe2 -x 1 -O 0 Here is the result


Gemmini extension configured with:
dim = 16
bbl loader
Loaded runner program
Using systolic in mode 1
Using Onnxruntime C++ API
terminate called after throwing an instance of 'Ort::Exception'
what():  /home/zenk/onnxruntime-riscv/onnxruntime/core/session/inference_session.cc:245 onnxruntime::InferenceSession::InferenceSession(const onnxruntime::SessionOptions&, const onnxruntime::Environment&, const string&) status.IsOK() was false. Given model could not be parsed while creating inference session. Error message: Protobuf parsing failed.

bad syscall #131!

**Expected behavior**
Get some outputs.

**Additional context**
I also have some other questions. I want to get a network to do face verification. And I found ```https://github.com/onnx/models/tree/master/vision/body_analysis/arcface```. But I tried run the onnx file directly, and it gives me the same bad syscall #131. I also tried to quantize this network by running ```python3 calibrate.py --model_path arcfaceresnet100-8.onnx  --dataset_path ./ --output_model_path arcfaceresnet100-8_quantized.onnx --static=True --data_preprocess=mxnet --mode=int8```. It also failed. 
Here is the result

Traceback (most recent call last): File "calibrate.py", line 379, in main() File "calibrate.py", line 348, in main args.data_preprocess) File "calibrate.py", line 289, in load_test_data preprocess_method) File "calibrate.py", line 261, in load_single_test_data 'Number of input protobufs does not match expected model inputs') ValueError: Number of input protobufs does not match expected model inputs



Why I cannot run some network, and why I get the error when try to quantize a network?  How to fix them? Thanks for helping.

pranav-prakash commented 4 years ago

Hm I've never seen that "Error message: Protobuf parsing failed" before. That syscall seems to correspond to tgkill so I think that's just the runner trying to kill itself after the failed protobuf parse.

I cloned a fresh version of the repo (on centos7) and couldn't reproduce the issue, which is going to make this a bit harder to debug. (Just as a quick sanity-check, can you run md5sum googlenet_quantized.onnx? It should be e1360471e07e0810ee3696eb44e66c57).

The only significant difference I can see at the moment is that your version of pk / spike seems to be a bit newer than what I've been using (in particular mine doesn't print "bbl loader").

Can you try building spike via the following:

Clone https://github.com/ucb-bar/esp-tools (using the default master branch, latest commit should be dcb6012f7)
Apply the syscall patches inside the riscv-pk folder
Inside riscv-isa-sim, checkout commit 506d83f6a8d on the master branch
Build spike and try using that built version of spike to run.

Hopefully this works, if not we can do some more digging (the fact that protobuf parsing fails seems to indicate that it's a pretty fundamental issue unrelated to the gemmini extension).

As for the calibration script, it's written to assume that the --dataset_path directory contains a bunch of folders labeled test_data_set_1, test_data_set_2, etc. each of which contains an input_0.pb, input_1.pb, and so on. I haven't checked what the folder structure of the arcface model looks like, but the error seems to indicate that the folder structure might not match what's described above.

Also do note that the imagenet_runner was hard-coded for using fixed 224x224 pngs as input, so you'll probably have to create your own runner for arcface, especially since it seems that the model can accept variable-sized images (so you'll have to manually set the input tensor dimensions – you can look at the upstream onnxruntime api docs for more info).

Also last point for the mxnet derived models is that I've found them to be very finnicky and sensitive to quantization. You might want to first get the non-quantized (i.e. floating point) model running first (for which you'll likely need to write your own runner), at which point you can start playing around with quantization. The accuracy for ResNet models is also a bit poor at the moment because we're limited to power-of-2 scale factors. I've also never really tried the quantization/calibration scripts with non-imagenet models (although they should presumably work since they were modified from microsoft's upstream implementation), so I'd be interested to hear how it turns out!

pranav-prakash commented 4 years ago

Oh another thing you can try is to run it with qemu. If you download qemu and enable riscv-user space emulation via ./configure --disable-system --target-list=riscv64-linux-user, you should get a qemu-riscv64 binary that you can use instead of spike (just substitute spike --extension=gemmini pk for qemu-riscv64). Of course if you use qemu you'll have to use -x 0 (don't use gemmini instructions, i.e. emulate them via cpu only) but this should help us debug where the issue lies.

J-Zenk commented 4 years ago

Thanks a lot. The md5 value does not match. So I re-download the onnx file with a proxy, and it works. And I messed the structure of the arcface folder so I got such error. Now inside the folder I have a onnx file and 3 folders named test_data_set_0 , test_data_set_1, test_data_set_2, each folder contains a input_0.pb and a output_0.pb. I tried to quantize the network again, it show as below. What causes this error?

Num cases 3, num inputs for each cases 1
2020-07-16 17:11:44.181980728 [E:onnxruntime:, sequential_executor.cc:281 Execute] Non-zero status code returned while running BatchNormalization node. Name:'bn0' Status Message: Invalid input scale: NumDimensions() != 3
Traceback (most recent call last):
  File "calibrate.py", line 379, in <module>
    main()
  File "calibrate.py", line 358, in main
    inputs, calib_mode)
  File "calibrate.py", line 118, in get_intermediate_outputs
    for j in range(num_input_names)}) for i in range(num_inputs)
  File "calibrate.py", line 118, in <listcomp>
    for j in range(num_input_names)}) for i in range(num_inputs)
  File "/usr/local/lib64/python3.6/site-packages/onnxruntime/capi/session.py", line 111, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running BatchNormalization node. Name:'bn0' Status Message: Invalid input scale: NumDimensions() != 3

pranav-prakash commented 4 years ago

Seems like this is relevant; please read through it (and any associated linked issues therein) and try the suggested fix?

https://github.com/onnx/models/issues/242

J-Zenk commented 4 years ago

I have tried the fix, and it gives me another error. It seems that the problem is caused by the network, too. Because I also tried googlenet in onnx model zoo and succeed. So maybe i will build a new network and convert it to onnx format and then quantize it. If I do so, Is there anything I should pay attention to, so that I can avoid some strange problems like what happens during the arcface's conversion? Thank you very much.

pranav-prakash commented 4 years ago

and it gives me another error.

What error did it give? I do recall that mxnet-based models had some batch-normalization version mismatch thing after quantization because they use some attribute that was removed in opsets newer than opset version 7 (and quantized operators require opset 10 or above). If that was the case you could try using this tool to convert it to a newer opset before running the quantization, but as you suggested it's probably a better idea to just export a new model from PyTorch with the latest opset.

Is there anything I should pay attention to, so that I can avoid some strange problems like what happens during the arcface's conversion

I think just making sure to export the latest opset version (anything >= version 10 should be fine) is sufficient. Let me know if you run into any issues though.

J-Zenk commented 4 years ago

What error did it give?

This is the error. It seems that it is not caused by opset verison, because the origin network is opset version 8 and have the same error as the opset version 9 one which is converted by convert_to_opset9.py.

Num cases 3, num inputs for each cases 1
{'conv0': (-2.7700576782226562, 1.8512709140777588), '_mulscalar0': (-1.0126389265060425, -0.9754691123962402), 'stage1_unit1_conv1': (-1.5255255699157715, 1.1363569498062134), 'stage1_unit1_bn1': (-2.677807331085205, 2.351346015930176), 'stage1_unit1_conv2': (-1.0140470266342163, 0.8351755142211914), 'stage1_unit1_relu1': (-0.022769352421164513, 1.9310780763626099), 'stage1_unit1_conv1sc': (-0.3902704417705536, 0.36174634099006653), 'relu0': (-0.5688446164131165, 1.088736653327942), 'stage1_unit2_conv1': (-0.9079145789146423, 1.004768967628479), 'stage1_unit2_bn1': (-0.7862108945846558, 0.780624508857727), 'stage1_unit2_conv2': (-1.0263770818710327, 0.8691070079803467), 'stage1_unit2_relu1': (-0.7567529678344727, 1.049529790878296), 'stage1_unit3_conv1': (-0.5032097697257996, 0.7926806211471558), 'stage1_unit3_bn1': (-1.4762139320373535, 1.992246389389038), 'stage1_unit3_conv2': (-0.4169164001941681, 0.9362981915473938), 'stage1_unit3_relu1': (3.6227927324716802e-09, 2.7148284912109375), 'stage2_unit1_conv1': (-0.580380916595459, 1.6314289569854736), 'stage2_unit1_bn1': (-1.9812438488006592, 4.54368257522583), 'stage2_unit1_conv2': (-1.1352336406707764, 1.2988908290863037), 'stage2_unit1_relu1': (-0.011583560146391392, 5.083901882171631), 'stage2_unit1_conv1sc': (-0.9727423191070557, 1.723375916481018), '_plus2': (-1.670774221420288, 10.384163856506348), 'stage2_unit2_conv1': (-0.2573625147342682, 0.13485988974571228), 'stage2_unit2_bn1': (-0.6998764872550964, 0.27929526567459106), 'stage2_unit2_conv2': (-0.24395932257175446, 0.2444741427898407), 'stage2_unit2_relu1': (3.544041060621339e-09, 1.6052155494689941), 'stage2_unit3_conv1': (-0.33123549818992615, 0.41484981775283813), 'stage2_unit3_bn1': (-0.8922042846679688, 0.544468343257904), 'stage2_unit3_conv2': (-0.5210393071174622, 0.2614585757255554), 'stage2_unit3_relu1': (-0.004366954322904348, 1.0961897373199463), 'stage2_unit4_conv1': (-0.21901430189609528, 0.6279692053794861), 'stage2_unit4_bn1': (-0.9523730278015137, 0.4922447204589844), 'stage2_unit4_conv2': (-0.24297407269477844, 0.27929195761680603), 'stage2_unit4_relu1': (-0.004497249145060778, 0.7889650464057922), 'stage2_unit5_conv1': (-0.397145539522171, 0.7217621207237244), 'stage2_unit5_bn1': (-0.6942195296287537, 0.5839004516601562), 'stage2_unit5_conv2': (-0.21408230066299438, 0.4311734437942505), 'stage2_unit5_relu1': (-0.026107193902134895, 0.5513373613357544), 'stage2_unit6_conv1': (-0.3651759624481201, 0.6158580780029297), 'stage2_unit6_bn1': (-0.5249855518341064, 0.7201350331306458), 'stage2_unit6_conv2': (-0.26098182797431946, 0.27301323413848877), 'stage2_unit6_relu1': (-0.0027470411732792854, 0.5985389947891235), 'stage2_unit7_conv1': (-0.27477845549583435, 0.5691264271736145), 'stage2_unit7_bn1': (-0.6672435998916626, 0.7075446248054504), 'stage2_unit7_conv2': (-0.19847100973129272, 0.2942486107349396), 'stage2_unit7_relu1': (-0.0025474284775555134, 0.44384798407554626), 'stage2_unit8_conv1': (-0.295512318611145, 0.42267584800720215), 'stage2_unit8_bn1': (-0.5995339155197144, 0.6014330387115479), 'stage2_unit8_conv2': (-0.166782408952713, 0.25694501399993896), 'stage2_unit8_relu1': (-0.0017675124108791351, 0.33467692136764526), 'stage2_unit9_conv1': (-0.41656625270843506, 0.37687447667121887), 'stage2_unit9_bn1': (-0.6179551482200623, 0.39605095982551575), 'stage2_unit9_conv2': (-0.2933601438999176, 0.25816017389297485), 'stage2_unit9_relu1': (2.3207581989481696e-07, 0.5152771472930908), 'stage2_unit10_conv1': (-0.4387154281139374, 0.45474547147750854), 'stage2_unit10_bn1': (-0.6467996835708618, 0.5094567537307739), 'stage2_unit10_conv2': (-0.21270819008350372, 0.2974368631839752), 'stage2_unit10_relu1': (2.644956111907959e-07, 0.35188809037208557), 'stage2_unit11_conv1': (-0.35708561539649963, 0.3530902862548828), 'stage2_unit11_bn1': (-0.5718078017234802, 0.4663313329219818), 'stage2_unit11_conv2': (-0.20177653431892395, 0.26275286078453064), 'stage2_unit11_relu1': (9.807602197042797e-08, 0.3139716386795044), 'stage2_unit12_conv1': (-0.2546583414077759, 0.31906643509864807), 'stage2_unit12_bn1': (-0.4647640883922577, 0.44840580224990845), 'stage2_unit12_conv2': (-0.18789871037006378, 0.14361216127872467), 'stage2_unit12_relu1': (9.652539034732399e-08, 0.3494085967540741), 'stage2_unit13_conv1': (-0.41568630933761597, 0.395785927772522), 'stage2_unit13_bn1': (-0.6449421048164368, 0.5393414497375488), 'stage2_unit13_conv2': (-0.3820760250091553, 0.6321218609809875), 'stage2_unit13_relu1': (1.5871721714688647e-08, 0.4499181807041168), 'stage3_unit1_conv1': (-0.5726640820503235, 0.827939510345459), 'stage3_unit1_bn1': (-0.7210996150970459, 0.8041184544563293), 'stage3_unit1_conv2': (-0.4858264923095703, 0.3447681665420532), 'stage3_unit1_relu1': (-0.018802566453814507, 0.411818265914917), 'stage3_unit1_conv1sc': (-1.9470412731170654, 1.312037467956543), '_plus15': (-4.541324615478516, 2.1670632362365723), 'stage3_unit2_conv1': (-0.22424371540546417, 0.14657652378082275), 'stage3_unit2_bn1': (-0.40753498673439026, 0.3977765142917633), 'stage3_unit2_conv2': (-0.17768806219100952, 0.2587220072746277), 'stage3_unit2_relu1': (2.494394664778743e-12, 0.39966338872909546), 'stage3_unit3_conv1': (-0.1841924488544464, 0.17269444465637207), 'stage3_unit3_bn1': (-0.3992823660373688, 0.3318554759025574), 'stage3_unit3_conv2': (-0.14911897480487823, 0.15426433086395264), 'stage3_unit3_relu1': (4.030409428423809e-08, 0.30615344643592834), 'stage3_unit4_conv1': (-0.20091697573661804, 0.2796117663383484), 'stage3_unit4_bn1': (-0.5242156386375427, 0.38267460465431213), 'stage3_unit4_conv2': (-0.15447556972503662, 0.17338211834430695), 'stage3_unit4_relu1': (-0.001559401280246675, 0.395333856344223), 'stage3_unit5_conv1': (-0.3620983958244324, 0.2256709337234497), 'stage3_unit5_bn1': (-0.47639572620391846, 0.5078409910202026), 'stage3_unit5_conv2': (-0.2632167935371399, 0.22298793494701385), 'stage3_unit5_relu1': (-0.004165561404079199, 0.34686315059661865), 'stage3_unit6_conv1': (-0.519700288772583, 0.40025827288627625), 'stage3_unit6_bn1': (-0.6167299747467041, 0.5394785404205322), 'stage3_unit6_conv2': (-0.2962507903575897, 0.286422997713089), 'stage3_unit6_relu1': (-0.003223944455385208, 0.49111926555633545), 'stage3_unit7_conv1': (-0.2245730757713318, 0.29485976696014404), 'stage3_unit7_bn1': (-0.5814918279647827, 0.7422151565551758), 'stage3_unit7_conv2': (-0.1640111804008484, 0.1288129687309265), 'stage3_unit7_relu1': (1.3635625831578957e-12, 0.36254796385765076), 'stage3_unit8_conv1': (-0.3018781542778015, 0.27658092975616455), 'stage3_unit8_bn1': (-0.5469644665718079, 0.5125359296798706), 'stage3_unit8_conv2': (-0.2527916729450226, 0.2789697051048279), 'stage3_unit8_relu1': (3.091239070274199e-11, 0.502379834651947), 'stage3_unit9_conv1': (-0.545325756072998, 0.24950751662254333), 'stage3_unit9_bn1': (-0.6723271012306213, 0.7246034741401672), 'stage3_unit9_conv2': (-0.15685269236564636, 0.11693911254405975), 'stage3_unit9_relu1': (-0.01037586573511362, 0.3197324275970459), 'stage3_unit10_conv1': (-0.5106210112571716, 0.30369535088539124), 'stage3_unit10_bn1': (-0.6339330077171326, 0.6078217625617981), 'stage3_unit10_conv2': (-0.20839078724384308, 0.21890105307102203), 'stage3_unit10_relu1': (-0.0024764223489910364, 0.3341023027896881), 'stage3_unit11_conv1': (-0.435392826795578, 0.3541238307952881), 'stage3_unit11_bn1': (-0.5874464511871338, 0.5702937245368958), 'stage3_unit11_conv2': (-0.301033616065979, 0.32429423928260803), 'stage3_unit11_relu1': (-0.0008741768542677164, 0.33039283752441406), 'stage3_unit12_conv1': (-0.37518972158432007, 0.4138906002044678), 'stage3_unit12_bn1': (-0.5808767080307007, 0.6330219507217407), 'stage3_unit12_conv2': (-0.3561904728412628, 0.43344947695732117), 'stage3_unit12_relu1': (-0.0017991125350818038, 0.4858672618865967), 'stage3_unit13_conv1': (-0.3747340142726898, 0.3277401030063629), 'stage3_unit13_bn1': (-0.5056686997413635, 0.8359493613243103), 'stage3_unit13_conv2': (-0.16904059052467346, 0.1480470448732376), 'stage3_unit13_relu1': (-0.011989030055701733, 0.38342130184173584), 'stage3_unit14_conv1': (-0.2563401460647583, 0.2480783462524414), 'stage3_unit14_bn1': (-0.390505313873291, 0.8186060786247253), 'stage3_unit14_conv2': (-0.10976225882768631, 0.09959360957145691), 'stage3_unit14_relu1': (8.161140385709587e-08, 0.321338951587677), 'stage3_unit15_conv1': (-0.25992459058761597, 0.33974120020866394), 'stage3_unit15_bn1': (-0.401766836643219, 0.5625117421150208), 'stage3_unit15_conv2': (-0.16411973536014557, 0.20697088539600372), 'stage3_unit15_relu1': (-0.033499810844659805, 0.33262568712234497), 'stage3_unit16_conv1': (-0.12597525119781494, 0.16341085731983185), 'stage3_unit16_bn1': (-0.642011821269989, 0.7743183374404907), 'stage3_unit16_conv2': (-0.15693651139736176, 0.11338046938180923), 'stage3_unit16_relu1': (9.637012077234886e-09, 0.3087193965911865), 'stage3_unit17_conv1': (-0.35468026995658875, 0.3821070194244385), 'stage3_unit17_bn1': (-0.6037321090698242, 0.6920698285102844), 'stage3_unit17_conv2': (-0.17673300206661224, 0.26448488235473633), 'stage3_unit17_relu1': (1.4634181866313156e-07, 0.464324027299881), 'stage3_unit18_conv1': (-0.3813590407371521, 0.2828254997730255), 'stage3_unit18_bn1': (-0.5927966833114624, 0.6548709273338318), 'stage3_unit18_conv2': (-0.17128826677799225, 0.1501263678073883), 'stage3_unit18_relu1': (-0.007886271923780441, 0.44737768173217773), 'stage3_unit19_conv1': (-0.22415035963058472, 0.2810840904712677), 'stage3_unit19_bn1': (-0.7576963901519775, 1.069539189338684), 'stage3_unit19_conv2': (-0.2111537754535675, 0.1566932052373886), 'stage3_unit19_relu1': (-0.003581683151423931, 0.5865971446037292), 'stage3_unit20_conv1': (-0.32629501819610596, 0.30639371275901794), 'stage3_unit20_bn1': (-0.7453984022140503, 0.7082696557044983), 'stage3_unit20_conv2': (-0.16008047759532928, 0.11353600025177002), 'stage3_unit20_relu1': (1.0354535788792418e-06, 0.45895808935165405), 'stage3_unit21_conv1': (-0.1636548638343811, 0.21434618532657623), 'stage3_unit21_bn1': (-0.6586788892745972, 0.8185128569602966), 'stage3_unit21_conv2': (-0.19033634662628174, 0.19410300254821777), 'stage3_unit21_relu1': (3.796392367139134e-10, 0.2997061014175415), 'stage3_unit22_conv1': (-0.12159842997789383, 0.1409245729446411), 'stage3_unit22_bn1': (-0.5207512974739075, 0.674064576625824), 'stage3_unit22_conv2': (-0.12496834993362427, 0.20494888722896576), 'stage3_unit22_relu1': (7.566335114006506e-08, 0.25298017263412476), 'stage3_unit23_conv1': (-0.11415666341781616, 0.14841502904891968), 'stage3_unit23_bn1': (-0.5651000142097473, 0.7120834589004517), 'stage3_unit23_conv2': (-0.09524092078208923, 0.0823959931731224), 'stage3_unit23_relu1': (8.248670724242402e-08, 0.20683155953884125), 'stage3_unit24_conv1': (-0.21861375868320465, 0.21948696672916412), 'stage3_unit24_bn1': (-0.4840283691883087, 0.8351663947105408), 'stage3_unit24_conv2': (-0.15638257563114166, 0.09795545041561127), 'stage3_unit24_relu1': (3.829651404885226e-07, 0.3478612005710602), 'stage3_unit25_conv1': (-0.16192227602005005, 0.17361211776733398), 'stage3_unit25_bn1': (-0.5178247690200806, 0.5179949998855591), 'stage3_unit25_conv2': (-0.07853621244430542, 0.09793967008590698), 'stage3_unit25_relu1': (1.5551707122085645e-07, 0.20840789377689362), 'stage3_unit26_conv1': (-0.20955918729305267, 0.1912509649991989), 'stage3_unit26_bn1': (-0.5183373689651489, 0.5781286358833313), 'stage3_unit26_conv2': (-0.2181871086359024, 0.08206260949373245), 'stage3_unit26_relu1': (1.273264871315405e-07, 0.260122686624527), 'stage3_unit27_conv1': (-0.23040422797203064, 0.22308039665222168), 'stage3_unit27_bn1': (-0.47349828481674194, 0.4709331691265106), 'stage3_unit27_conv2': (-0.06822191178798676, 0.06991805881261826), 'stage3_unit27_relu1': (3.3852268188638845e-07, 0.2227790206670761), 'stage3_unit28_conv1': (-0.287468820810318, 0.35272350907325745), 'stage3_unit28_bn1': (-0.6915250420570374, 0.4747917354106903), 'stage3_unit28_conv2': (-0.07850893586874008, 0.2568800151348114), 'stage3_unit28_relu1': (3.09994362623911e-07, 0.36390799283981323), 'stage3_unit29_conv1': (-0.2683931887149811, 0.29413872957229614), 'stage3_unit29_bn1': (-0.609352707862854, 0.6880134344100952), 'stage3_unit29_conv2': (-0.26083219051361084, 0.11767230927944183), 'stage3_unit29_relu1': (-0.019393207505345345, 0.22927677631378174), 'stage3_unit30_conv1': (-0.22444383800029755, 0.24487926065921783), 'stage3_unit30_bn1': (-0.45070981979370117, 0.5351380109786987), 'stage3_unit30_conv2': (-0.09777919948101044, 0.07721813768148422), 'stage3_unit30_relu1': (-0.003372140461578965, 0.26520460844039917), 'stage4_unit1_conv1': (-0.3942401111125946, 0.49705347418785095), 'stage4_unit1_bn1': (-0.7365911602973938, 0.5604126453399658), 'stage4_unit1_conv2': (-0.10353611409664154, 0.09256849437952042), 'stage4_unit1_relu1': (6.8581509360399195e-09, 0.3770582675933838), 'stage4_unit1_conv1sc': (-1.0546667575836182, 1.5163278579711914), '_plus45': (-3.2545576095581055, 4.6348419189453125), 'stage4_unit2_conv1': (-0.13596974313259125, 0.13410010933876038), 'stage4_unit2_bn1': (-0.20413748919963837, 0.25919032096862793), 'stage4_unit2_conv2': (-0.034581054002046585, 0.03934990242123604), 'stage4_unit2_relu1': (-0.0, 0.14119192957878113), 'stage4_unit3_conv1': (0.0, 0.0), 'stage4_unit3_bn1': (-7.121272678836931e-39, 6.233115699163219e-39), 'stage4_unit3_conv2': (0.0, 0.0), 'stage4_unit3_relu1': (-0.0, 1.975722934716239e-39)}
{'conv0': [0, 0.021811477781280758], '_mulscalar0': [0, 0.007973534854378288], 'stage1_unit1_conv1': [0, 0.012012012361541508], 'stage1_unit1_bn1': [0, 0.021085097095159096], 'stage1_unit1_conv2': [0, 0.007984622256962334], 'stage1_unit1_relu1': [0, 0.015205339183957558], 'stage1_unit1_conv1sc': [0, 0.003072995604492548], 'relu0': [0, 0.008572729553763323], 'stage1_unit2_conv1': [0, 0.007911566674239992], 'stage1_unit2_bn1': [0, 0.00619063696523351], 'stage1_unit2_conv2': [0, 0.008081709306071124], 'stage1_unit2_relu1': [0, 0.008264014101403904], 'stage1_unit3_conv1': [0, 0.006241579694072093], 'stage1_unit3_bn1': [0, 0.015686979444008174], 'stage1_unit3_conv2': [0, 0.007372426705097589], 'stage1_unit3_relu1': [0, 0.021376602292999508], 'stage2_unit1_conv1': [0, 0.012845897299098217], 'stage2_unit1_bn1': [0, 0.03577702815138449], 'stage2_unit1_conv2': [0, 0.010227486843199242], 'stage2_unit1_relu1': [0, 0.040030723481666385], 'stage2_unit1_conv1sc': [0, 0.013569889106149749], '_plus2': [0, 0.08176506973627046], 'stage2_unit2_conv1': [0, 0.002026476493970616], 'stage2_unit2_bn1': [0, 0.005510838482323594], 'stage2_unit2_conv2': [0, 0.0019249932503137062], 'stage2_unit2_relu1': [0, 0.01263949251550389], 'stage2_unit3_conv1': [0, 0.0032665339980538434], 'stage2_unit3_bn1': [0, 0.007025230587936762], 'stage2_unit3_conv2': [0, 0.004102671709586316], 'stage2_unit3_relu1': [0, 0.008631415254487766], 'stage2_unit4_conv1': [0, 0.004944639412436899], 'stage2_unit4_bn1': [0, 0.007499000218909557], 'stage2_unit4_conv2': [0, 0.0021991492725732757], 'stage2_unit4_relu1': [0, 0.006212323200045608], 'stage2_unit5_conv1': [0, 0.005683166304911215], 'stage2_unit5_bn1': [0, 0.005466295508887824], 'stage2_unit5_conv2': [0, 0.0033950664865689015], 'stage2_unit5_relu1': [0, 0.0043412390656358615], 'stage2_unit6_conv1': [0, 0.004849276204747478], 'stage2_unit6_bn1': [0, 0.005670354591579888], 'stage2_unit6_conv2': [0, 0.002149710505027471], 'stage2_unit6_relu1': [0, 0.0047129054707805], 'stage2_unit7_conv1': [0, 0.004481310450185941], 'stage2_unit7_bn1': [0, 0.005571217518153153], 'stage2_unit7_conv2': [0, 0.002316918194763304], 'stage2_unit7_relu1': [0, 0.0034948660163428838], 'stage2_unit8_conv1': [0, 0.003328156283521277], 'stage2_unit8_bn1': [0, 0.004735693218201164], 'stage2_unit8_conv2': [0, 0.0020231890866136927], 'stage2_unit8_relu1': [0, 0.0026352513493515376], 'stage2_unit9_conv1': [0, 0.0032800492339246853], 'stage2_unit9_bn1': [0, 0.004865788568661908], 'stage2_unit9_conv2': [0, 0.002309922392912737], 'stage2_unit9_relu1': [0, 0.0040573003723865415], 'stage2_unit10_conv1': [0, 0.00358067300375991], 'stage2_unit10_bn1': [0, 0.005092910894258755], 'stage2_unit10_conv2': [0, 0.00234202254475571], 'stage2_unit10_relu1': [0, 0.002770772365134532], 'stage2_unit11_conv1': [0, 0.0028116977590275562], 'stage2_unit11_bn1': [0, 0.004502423635617954], 'stage2_unit11_conv2': [0, 0.0020689201636577216], 'stage2_unit11_relu1': [0, 0.0024722176273976725], 'stage2_unit12_conv1': [0, 0.0025123341346350242], 'stage2_unit12_bn1': [0, 0.0036595597511201393], 'stage2_unit12_conv2': [0, 0.001479517404488691], 'stage2_unit12_relu1': [0, 0.002751248793339166], 'stage2_unit13_conv1': [0, 0.00327312054596548], 'stage2_unit13_bn1': [0, 0.005078284289893203], 'stage2_unit13_conv2': [0, 0.004977337488039272], 'stage2_unit13_relu1': [0, 0.003542662840189896], 'stage3_unit1_conv1': [0, 0.0065192087428776295], 'stage3_unit1_bn1': [0, 0.006331641373671885], 'stage3_unit1_conv2': [0, 0.0038254054512564593], 'stage3_unit1_relu1': [0, 0.0032426635111410785], 'stage3_unit1_conv1sc': [0, 0.015331033646591066], '_plus15': [0, 0.03575846153920091], 'stage3_unit2_conv1': [0, 0.0017656985464997179], 'stage3_unit2_bn1': [0, 0.0032089369034203957], 'stage3_unit2_conv2': [0, 0.0020371811596427377], 'stage3_unit2_relu1': [0, 0.0031469558167645313], 'stage3_unit3_conv1': [0, 0.001450334242948397], 'stage3_unit3_bn1': [0, 0.0031439556380895178], 'stage3_unit3_conv2': [0, 0.0012146797705823043], 'stage3_unit3_relu1': [0, 0.0024106570585506167], 'stage3_unit4_conv1': [0, 0.002201667451483058], 'stage3_unit4_bn1': [0, 0.004127682193996399], 'stage3_unit4_conv2': [0, 0.0013652135302701335], 'stage3_unit4_relu1': [0, 0.0031128650105844333], 'stage3_unit5_conv1': [0, 0.0028511684710585228], 'stage3_unit5_bn1': [0, 0.003998747960788997], 'stage3_unit5_conv2': [0, 0.0020725731774577944], 'stage3_unit5_relu1': [0, 0.0027312059102095956], 'stage3_unit6_conv1': [0, 0.004092128258051835], 'stage3_unit6_bn1': [0, 0.004856141533438615], 'stage3_unit6_conv2': [0, 0.002332683388642439], 'stage3_unit6_relu1': [0, 0.0038670808311522475], 'stage3_unit7_conv1': [0, 0.0023217304485050713], 'stage3_unit7_bn1': [0, 0.005844213831143116], 'stage3_unit7_conv2': [0, 0.0012914266173295148], 'stage3_unit7_relu1': [0, 0.0028547083768318956], 'stage3_unit8_conv1': [0, 0.0023769933407700905], 'stage3_unit8_bn1': [0, 0.0043068068234000615], 'stage3_unit8_conv2': [0, 0.002196611851219117], 'stage3_unit8_relu1': [0, 0.00395574672954289], 'stage3_unit9_conv1': [0, 0.0042939035911259684], 'stage3_unit9_bn1': [0, 0.00570553916645801], 'stage3_unit9_conv2': [0, 0.001235060569808239], 'stage3_unit9_relu1': [0, 0.0025175781700554796], 'stage3_unit10_conv1': [0, 0.004020637883914737], 'stage3_unit10_bn1': [0, 0.0049915984859616735], 'stage3_unit10_conv2': [0, 0.0017236303391419058], 'stage3_unit10_relu1': [0, 0.0026307267936195914], 'stage3_unit11_conv1': [0, 0.003428289974768331], 'stage3_unit11_bn1': [0, 0.004625562607772707], 'stage3_unit11_conv2': [0, 0.0025534979471071497], 'stage3_unit11_relu1': [0, 0.0026015184057040478], 'stage3_unit12_conv1': [0, 0.0032589811039721874], 'stage3_unit12_bn1': [0, 0.0049844248088326045], 'stage3_unit12_conv2': [0, 0.003412988007537962], 'stage3_unit12_relu1': [0, 0.0038257264715480053], 'stage3_unit13_conv1': [0, 0.0029506615297062192], 'stage3_unit13_bn1': [0, 0.006582278435624491], 'stage3_unit13_conv2': [0, 0.0013310282718478226], 'stage3_unit13_relu1': [0, 0.0030190653688325656], 'stage3_unit14_conv1': [0, 0.0020184263469666007], 'stage3_unit14_bn1': [0, 0.006445717154525396], 'stage3_unit14_conv2': [0, 0.0008642697545487111], 'stage3_unit14_relu1': [0, 0.0025302279652572993], 'stage3_unit15_conv1': [0, 0.0026751275606981414], 'stage3_unit15_bn1': [0, 0.004429226315866305], 'stage3_unit15_conv2': [0, 0.0016296920109921554], 'stage3_unit15_relu1': [0, 0.0026190998986011416], 'stage3_unit16_conv1': [0, 0.0012866996639356839], 'stage3_unit16_bn1': [0, 0.00609699478299599], 'stage3_unit16_conv2': [0, 0.001235720562183951], 'stage3_unit16_relu1': [0, 0.0024308613904817835], 'stage3_unit17_conv1': [0, 0.0030087166883814055], 'stage3_unit17_bn1': [0, 0.005449368728427436], 'stage3_unit17_conv2': [0, 0.0020825581287774514], 'stage3_unit17_relu1': [0, 0.003656094703148669], 'stage3_unit18_conv1': [0, 0.0030028270924185203], 'stage3_unit18_bn1': [0, 0.005156463994754581], 'stage3_unit18_conv2': [0, 0.001348726510062931], 'stage3_unit18_relu1': [0, 0.00352265891127699], 'stage3_unit19_conv1': [0, 0.0022132605548918715], 'stage3_unit19_bn1': [0, 0.008421568419989638], 'stage3_unit19_conv2': [0, 0.0016626281531776969], 'stage3_unit19_relu1': [0, 0.004618875154360073], 'stage3_unit20_conv1': [0, 0.002569252111780362], 'stage3_unit20_bn1': [0, 0.005869278757590947], 'stage3_unit20_conv2': [0, 0.0012604762015380258], 'stage3_unit20_relu1': [0, 0.0036138432232413705], 'stage3_unit21_conv1': [0, 0.0016877652387919388], 'stage3_unit21_bn1': [0, 0.006444983125671627], 'stage3_unit21_conv2': [0, 0.0015283700988048643], 'stage3_unit21_relu1': [0, 0.0023598905623428467], 'stage3_unit22_conv1': [0, 0.0011096423066507175], 'stage3_unit22_bn1': [0, 0.005307595091541921], 'stage3_unit22_conv2': [0, 0.0016137707655824076], 'stage3_unit22_relu1': [0, 0.001991969863260825], 'stage3_unit23_conv1': [0, 0.0011686222759757455], 'stage3_unit23_bn1': [0, 0.005606956369294895], 'stage3_unit23_conv2': [0, 0.0007499285100951908], 'stage3_unit23_relu1': [0, 0.00162859495699875], 'stage3_unit24_conv1': [0, 0.0017282438325131033], 'stage3_unit24_bn1': [0, 0.006576113344177487], 'stage3_unit24_conv2': [0, 0.0012313588632373359], 'stage3_unit24_relu1': [0, 0.002739064571425671], 'stage3_unit25_conv1': [0, 0.0013670245493490865], 'stage3_unit25_bn1': [0, 0.004078700786500465], 'stage3_unit25_conv2': [0, 0.0007711785046134408], 'stage3_unit25_relu1': [0, 0.0016410070376133355], 'stage3_unit26_conv1': [0, 0.0016500723408901785], 'stage3_unit26_bn1': [0, 0.004552193983333317], 'stage3_unit26_conv2': [0, 0.0017180087294165543], 'stage3_unit26_relu1': [0, 0.002048210130901787], 'stage3_unit27_conv1': [0, 0.0018142065194648083], 'stage3_unit27_bn1': [0, 0.0037283329513129286], 'stage3_unit27_conv2': [0, 0.0005505358961623484], 'stage3_unit27_relu1': [0, 0.0017541655170635914], 'stage3_unit28_conv1': [0, 0.0027773504651437595], 'stage3_unit28_bn1': [0, 0.005445079071315255], 'stage3_unit28_conv2': [0, 0.002022677284526074], 'stage3_unit28_relu1': [0, 0.0028654172664552223], 'stage3_unit29_conv1': [0, 0.0023160529887582375], 'stage3_unit29_bn1': [0, 0.005417428617402325], 'stage3_unit29_conv2': [0, 0.002053796775697723], 'stage3_unit29_relu1': [0, 0.001805328947352612], 'stage3_unit30_conv1': [0, 0.0019281831547969907], 'stage3_unit30_bn1': [0, 0.004213685125816526], 'stage3_unit30_conv2': [0, 0.0007699149565433892], 'stage3_unit30_relu1': [0, 0.0020882252633102295], 'stage4_unit1_conv1': [0, 0.0039138068833689055], 'stage4_unit1_bn1': [0, 0.0057999303960424705], 'stage4_unit1_conv2': [0, 0.0008152449928869413], 'stage4_unit1_relu1': [0, 0.002968962736955778], 'stage4_unit1_conv1sc': [0, 0.011939589432844027], '_plus45': [0, 0.03649481825941191], 'stage4_unit2_conv1': [0, 0.0010706278986818208], 'stage4_unit2_bn1': [0, 0.0020408686690443142], 'stage4_unit2_conv2': [0, 0.00030984175134831526], 'stage4_unit2_relu1': [0, 0.0011117474769982766], 'stage4_unit3_conv1': [0, 1], 'stage4_unit3_bn1': [0, 5.607301321918843e-41], 'stage4_unit3_conv2': [0, 1], 'stage4_unit3_relu1': [0, 1.555687350170267e-41]}
Warning: The original model opset version is 9, which does not support quantized operators.
            The opset version of quantized model will be set to 10. Use onnx model checker to verify model after quantization.
Traceback (most recent call last):
  File "calibrate.py", line 379, in <module>
    main()
  File "calibrate.py", line 372, in main
    symmetric_weight=args.mode == 'int8')
  File "/home/zenk/onnxruntime-riscv/systolic_runner/quantization/quantize.py", line 1418, in quantize
    quantizer.quantize_model()
  File "/home/zenk/onnxruntime-riscv/systolic_runner/quantization/quantize.py", line 312, in quantize_model
    new_list += self._quantize_convolution(node, new_list)
  File "/home/zenk/onnxruntime-riscv/systolic_runner/quantization/quantize.py", line 1296, in _quantize_convolution
    return self._quantize_convolution_qlinear_ops(node, new_nodes_list)
  File "/home/zenk/onnxruntime-riscv/systolic_runner/quantization/quantize.py", line 1188, in _quantize_convolution_qlinear_ops
    self._get_quantization_params(node.output[0])
  File "/home/zenk/onnxruntime-riscv/systolic_runner/quantization/quantize.py", line 644, in _get_quantization_params
    scale_values = [params[1].item()]
AttributeError: 'int' object has no attribute 'item'

pranav-prakash commented 4 years ago

Hm that's very weird. Looks like it's a bug in the quantization script then. I can reproduce this on my end so I'll take a look and see if I can figure out what's going on.

If you'd also like to try debugging this, maybe you could surround

      zero_point_values = [params[0].item()]

with a try-except and set a pdb breakpoint in the except. That way you can print out the parameters and take a look. It seems weird that params[1] exists, but the type isn't what's expected.

pranav-prakash commented 4 years ago

Ok fixed it! It's a 1 line change :)

https://github.com/pranav-prakash/onnxruntime-riscv/commit/e12f338c469ef09be1c7b8e950aed5202c6fed22

The issue was that the calibration script was returning a non-numpy type in the edge-case where rmin == rmax as it was calculating the scale for the quantization parameters. After this fix it successfully saves the quantized model.

Please keep me posted on your results. I'd be interested to see how well the quantized model performs in terms of accuracy. There's a good chance the accuracy might be off at first due to several reasons

Incorrect preprocessing step during quantization: I think it's possible that the sample data in input_0.pb has already been preprocessed, so you might want to try --data_preprocess=None when running quantization script
Incorrect preprocessing step during inference: They mention some preprocessing needed before feeding in the image to mxnet, but it's not clear to me if mxnet itself does further preprocessing of subtracting out the channel mean and dividing by standard deviation as most imagenet models seem to do. You might have to play around with this (you could verify by running the floating point model through a runner you create and seeing if results make sense).
The quantized model not responding well to rounding scales to power-of-2 factors: If you get to a point where you can run the floating point model through the script successfully but the quantized model behaves weirdly, it's possible that you're running into this issue. What you can try in this case is quantizing to uint8 instead of int8 which uses the native uint8 support in onnxruntime and doesn't suffer from the power-of-2 scale issue. That will quickly tell you if the issue is with the quantization step or with the inference step.

J-Zenk commented 4 years ago

Sorry for the late reply. The network now can be quantized. But actually I have not run the network successfully. There are some errors and seems caused by the nerual network. The first error is when running the network with the unmodified runner. And this is the message.

Gemmini extension configured with:
    dim = 16
bbl loader
Loaded runner program
Using systolic in mode 1
Using Onnxruntime C++ API
Number of inputs = 1
Input 0 : name=data, type=1, num_dims=4: [1, 3, 112, 112, ]
Number of outputs = 1
Output 0 : name=fc1, type=1, num_dims=2: [1, 512, ]
Loading image
Image dimensions: 224 224 3
First few image values 1.187174 1.426920 1.255673
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 12544, 27)
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 3136, 64)
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 12544, 576)
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 3136, 576)
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 3136, 576)
Called into systolic matmul!

......

Using accelerated matmul with dimensions (512, 49, 4608)
Called into systolic matmul!
Using accelerated matmul with dimensions (512, 49, 4608)
Called into systolic matmul!
Using accelerated matmul with dimensions (512, 49, 4608)
Called into systolic matmul!
1970-01-01 08:00:01.975564019 [E:onnxruntime:, sequential_executor.cc:277 Execute] Non-zero status code returned while running QLinearConv node. Name:'stage4_unit3_conv1_quant' Status Message: Divisor passed to systolic matmul must be power of 2
terminate called after throwing an instance of 'Ort::Exception'
  what():  Non-zero status code returned while running QLinearConv node. Name:'stage4_unit3_conv1_quant' Status Message: Divisor passed to systolic matmul must be power of 2
bad syscall #131!

And I tried to use the arcface_validation that they provide to measure the accuracy. But the quantized network cannot be loaded by the vaildation program. I tried to measure the origin network, too. And it still failed. It gives the information like this: Cannot broadcast gamma to data. gamma: [1,64,1,1], data: [1,64,112,112]. I find a issue about this and it seems that there is no solution yet. I also noticed the mlperf benchmark that mentioned by the readme document. And it seems do not support this network. I do not know if the runner's fixed input size can cause the first error(I still use the 224*224 picture to run). If not, this network may not suitable to continue. Thanks a lot for your helping.

pranav-prakash commented 4 years ago

The first error is when running the network with the unmodified runner Status Message: Divisor passed to systolic matmul must be power of 2

That's an error that shouldn't happen since we make sure to round to the nearest power. Can you add a print within this function https://github.com/pranav-prakash/onnxruntime-riscv/blob/4c7a0ad5c94fb90bbcc5c876c73e940d3c67d37d/onnxruntime/core/providers/systolic/helper/helper.h to see what the input and output are?

Edit: Ok I can reproduce this as well. I'll investigate and see what's happening.

Also this is unrelated to the error you got, but it seems that the network takes as input a 112x112 image that is the result of preprocessing as described here. You should be able to feed your 224x224 image through the preprocess script and then change the runner script to load 112x112 images. I'm not sure why the arcface documentation says "There are no constraints on the size of the image" when the model seems to expect a 112x112 image and the training data also used 112x112 images.

As for the mlperf benchmark, I have not tried that (that comes from microsoft), and it will likely not work with the gemmini backend anyway.

Finally, with regard to

I tried to measure the origin network, too. And it still failed.

Maybe it's better to export a clean model from pytorch? It seems the original model might be broken as described in the issue you linked. Alternatively maybe try upgrading the opset version using convert_to_opset9.py and see if it helps?

pranav-prakash commented 4 years ago

Ok found the issue. It's again an issue with the calibration/quantization script and in the same line as before. In this case rmin and rmax were both on the order of 1E-41 so the scale was essentially 0 and it overflowed an int when dividing by it. Fix is to not compare the floats directly but instead use np.isclose which has a default tolerance of 1E-8 which should be fine. Thanks for discovering these bugs!

Fixed in commit https://github.com/pranav-prakash/onnxruntime-riscv/commit/6475932331f1a0f223221eb40c79959b0f58741f (and a further correction to that fix in https://github.com/pranav-prakash/onnxruntime-riscv/commit/c0577996c8ac73075eea6f250e67bbf86631848e)

After this I can successfully run the model using the runner script (I have not tried post-processing the output so I don't know if it's accurate though).

I should probably also update the assertions in the cpp file since ORT_ENFORCE(X_scale_value != 0, "X_scale_value cannot be 0"); suffers from the same issue, but this shouldn't matter for now.

J-Zenk commented 4 years ago

Good, now I can run the quantized network. But I cannot run the orign network to get a comparison because of the Cannot broadcast gamma to data. gamma: [1,64,1,1], data: [1,64,112,112] problem. So I will try to build a new network and quantize it. If there are any problems later, I will open a new issue. Thank you very much for the days help.

ucb-bar / onnxruntime-riscv

Cannot run some network with spike #10