google-coral / pycoral

Python API for ML inferencing and transfer-learning on Coral devices
https://coral.ai
Apache License 2.0
347 stars 144 forks source link

Inconsistent behaviour on Dev Board (HIB Error) #65

Open alexw9988 opened 2 years ago

alexw9988 commented 2 years ago

Description

Hi all!

I'm using the Google Coral Dev Board and am experiencing very inconsistent results when trying to run simple benchmarking tests on quantized and converted segmentation models.

I am using a custom retrained MobilenetV2 256x3072xRGB with segmentation head. Sometimes it runs successful, however some other times it throws this error on invocation:

E driver/mmio_driver.cc:254] HIB Error. hib_error_status = 0000000000000008, hib_first_error_status = 0000000000000008
E driver/mmio_driver.cc:254] HIB Error. hib_error_status = 0000000000000008, hib_first_error_status = 0000000000000008

The issue is, I have no way to consistently reproduce the error. I can run my testing script several times in a row and roughly one in 10 tries will be successful, the others will result in an error containing the above lines.

I attached 3 different verbose log outputs where it failed and one where it succeeded. Note how the 3 different fail logs have vastly different line counts. The program execution will just hang forever and I had to kill the terminal whenever it got to the errors.

success.log fail3.log fail2.log fail.log

Can anyone tell me what's going on? Also, I'm using the latest runtime version (14) and used the latest compiler version when creating the edgetpu model:

On Dev Board:

>>> print(pycoral.utils.edgetpu.get_runtime_version())
BuildLabel(COMPILER=6.3.0 20170516,DATE=redacted,TIME=redacted), RuntimeVersion(14)
$ apt list --installed | grep coral
libedgetpu1-std/coral-edgetpu-stable,now 16.0 arm64 [installed]
python3-pycoral/coral-edgetpu-stable,now 2.0.0 arm64 [installed]
python3-tflite-runtime/coral-edgetpu-stable,now 2.5.0.post1 arm64 [installed]

On Colab Notebook where compilation takes place:

$ edgetpu_compiler --version
Edge TPU Compiler version 16.0.384591198
Click to expand! ### Issue Type Bug, Support ### Operating System Mendel Linux ### Coral Device Dev Board ### Other Devices _No response_ ### Programming Language Python 3.7 ### Relevant Log Output _No response_
alexw9988 commented 2 years ago

Also, here are the versions for gasket and apex on the Dev Board in case that's interesting:

Click to expand! ```bash $ uname -a Linux coral-board 4.14.98-imx #1 SMP PREEMPT Tue Nov 2 02:55:21 UTC 2021 aarch64 GNU/Linux ``` ```bash $ modinfo gasket filename: /lib/modules/4.14.98-imx/kernel/drivers/staging/gasket/gasket.ko author: Rob Springer license: GPL v2 version: 1.1.2 description: Google Gasket driver framework srcversion: 15745E4A74EF5DC52161CC8 depends: staging: Y intree: Y name: gasket vermagic: 4.14.98-imx SMP preempt mod_unload aarch64 ``` ```bash $ modinfo apex filename: /lib/modules/4.14.98-imx/kernel/drivers/staging/gasket/apex.ko author: John Joseph license: GPL v2 version: 1.1 description: Google Apex driver srcversion: A480A03FFEEB8547EBFE5A0 alias: pci:v00001AC1d0000089Asv*sd*bc*sc*i* depends: gasket staging: Y intree: Y name: apex vermagic: 4.14.98-imx SMP preempt mod_unload aarch64 parm: allow_power_save:int parm: allow_sw_clock_gating:int parm: allow_hw_clock_gating:int parm: bypass_top_level:int parm: trip_point0_temp:int parm: trip_point1_temp:int parm: trip_point2_temp:int parm: hw_temp_warn1:int parm: hw_temp_warn2:int parm: hw_temp_warn1_en:bool parm: hw_temp_warn2_en:bool parm: temp_poll_interval:int ```
hjonnala commented 2 years ago

Hi @alexw9988 can you please share model compilation logs.edgetpu_compiler -s model.tflite

alexw9988 commented 2 years ago

@hjonnala Here you go:

Edge TPU Compiler version 16.0.384591198
Started a compilation timeout timer of 180 seconds.

Model compiled successfully in 91877 ms.

Input model: /content/gdrive/MyDrive/models/new/protbuf_white_onelabel_v2_full_size/model_uint8.tflite
Input size: 3.57MiB
Output model: /content/gdrive/MyDrive/models/new/protbuf_white_onelabel_v2_full_size/edgetpuTEST/model_uint8_edgetpu.tflite
Output size: 26.09MiB
On-chip memory used for caching model parameters: 3.99MiB
On-chip memory remaining for caching model parameters: 2.33MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 90
Operation log: /content/gdrive/MyDrive/models/new/protbuf_white_onelabel_v2_full_size/edgetpuTEST/model_uint8_edgetpu.log

Operator                       Count      Status

ADD                            10         Mapped to Edge TPU
LOGISTIC                       1          Mapped to Edge TPU
CONCATENATION                  5          Mapped to Edge TPU
PAD                            4          Mapped to Edge TPU
DEPTHWISE_CONV_2D              17         Mapped to Edge TPU
TRANSPOSE_CONV                 5          Mapped to Edge TPU
CONV_2D                        46         Mapped to Edge TPU
QUANTIZE                       2          Mapped to Edge TPU
Compilation child process completed within timeout period.
Compilation succeeded!
hjonnala commented 2 years ago

@alexw9988 Most probably its due to model input size(256x3072x3). Can you try with reduced model input size. Please check this link for example input sizes: https://coral.ai/models/all/

If there are no concerns with the model please share the input CPU tflite model. Thanks.

alexw9988 commented 2 years ago

@hjonnala It did in fact work with smaller input sizes (i.e. 256x256), however in theory there should not be any issues with larger models, right?

Here's the model: model_white_256x3072_v2.tflite.zip

Thanks a lot!

hjonnala commented 2 years ago

I am able to run benchmark tests on dev board. Can you please share the test scripts that you are working with.

Edit: Now i am able to see HIB errors occasionally.

https://github.com/hjonnala/snippets/blob/main/devboard/single_model_benchmark

``` mendel@orange-horse:~$ ./single_model_benchmark -model model_white_256x3072_v2.tflite 2021-12-10 17:06:01 Running ./single_model_benchmark Run on (4 X 1500 MHz CPU s) Load Average: 0.10, 0.04, 0.01 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- BM_Model 459 ms 17.6 ms 13 model_white_256x3072_v2.tflite ``` ``` mendel@orange-horse:~$ ./single_model_benchmark -model model_white_256x3072_v2.tflite 2021-12-10 17:07:29 Running ./single_model_benchmark Run on (4 X 1500 MHz CPU s) Load Average: 0.02, 0.03, 0.00 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- BM_Model 458 ms 17.4 ms 13 model_white_256x3072_v2.tflite ``` ``` mendel@orange-horse:~$ ./single_model_benchmark -model model_white_256x3072_v2.tflite 2021-12-10 17:07:37 Running ./single_model_benchmark Run on (4 X 1500 MHz CPU s) Load Average: 0.02, 0.02, 0.00 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- BM_Model 457 ms 17.0 ms 13 model_white_256x3072_v2.tflite ```
alexw9988 commented 2 years ago

@hjonnala Glad I'm not the only one ;) Do you have any idea what's going on?

hjonnala commented 2 years ago

Hi @alexw9988 according to the error status value 0x8, it's an internal chip AXI bus write response error. Which suggests that there's some physical HW problem. It might be that chip is defective or it's having clock and/or power issues.

I think, its due to clock issues as it is happening occasionally. Please make sure you are providing 5V/3A power supply to the dev board.

alexw9988 commented 2 years ago

@hjonnala I'm using a proper 5V/3A power supply and the board is just a few weeks old - so I suppose there's not really much I can do ...

Would it help to try and reduce the CPU clock speed a bit?

liv-kuka commented 2 years ago

Hi @alexw9988, @hjonnala I have the same issue, HIB error with status 0x8 Did you find any solution?

hjonnala commented 2 years ago

@liv-kuka can you please share the model here and let us know on which Coral Product are you running the model? Thanks!

liv-kuka commented 2 years ago

I'm using Coral Dev Board, with yolov5s model. The model was exported using export.py from yolo project and then compiled with edgetpu_compliter. Trying to run inference with detect.py from yolo project (also tried with detect.py from google-coral/examples-camera repo but got error) If I run model with input size 480x480 or 640x640 I get HIB error. But with model 416x416 and lower inference running fine. Note: I have been using 2 Dev Boards and on one of them HIB error always appears earlier.

640x640 model: yolov5s-int8_edgetpu.zip

hjonnala commented 2 years ago

HIB error with status 0x8

This error in Dev Board is mostly related to stress due to extensive i/o operations. Please reduce the model input size or change the model itself.