google-coral / edgetpu

Coral issue tracker (and legacy Edge TPU API source)
https://coral.ai
Apache License 2.0
422 stars 125 forks source link

HIB error and negative reported temperatures whenever I try to use new PCIe edge TPU #741

Closed kevinmilner closed 1 year ago

kevinmilner commented 1 year ago

Description

Hello, I have been trying without success to get my new mini PCIe edge TPU to work. I am debugging with just the simple classify_image.py example.

My computer (an HP Z230 workstation) doesn't have any mPCIe slots, so I am using the Ableconn PEX-MP117 adapter; many people on Amazon have reported success using with this adapter with the Coral.

This system boots TrueNAS Scale bluefin, which is a debian-based system. I installed the drivers and such directly in the host os, so I'm running bare metal.

Here's the output when I try to run the example:

kevin@truenas:~/coral/pycoral$ python3 examples/classify_image.py \
--model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite \
--labels test_data/inat_bird_labels.txt \
--input test_data/parrot.jpg
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
E driver/mmio_driver.cc:254] HIB Error. hib_error_status = ffffffffffffffff, hib_first_error_status = ffffffffffffffff

It then hangs indefinitely. I tried the fix here for similar errors, but gasket.dma_bit_mask=32 didn't do anything (and I'm x84-64 not arm64).

Some interesting clues are that, upon a fresh reboot and before trying to run anything, I see the following:

kevin@truenas:~$ sudo lspci -vvv | grep MSI-X
    Capabilities: [d0] MSI-X: Enable+ Count=128 Masked-
kevin@truenas:~$ ll /dev/apex_0 
crw-rw---- 1 root apex 120, 0 Apr  5 14:51 /dev/apex_0
kevin@truenas:~$ cat /sys/class/apex/apex_0/temp
43550

...but, after I attempt to run the example, MSI-X becomes disabled and the temperature goes negative:

kevin@truenas:~$ sudo lspci -vvv | grep MSI-X
    Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
kevin@truenas:~$ cat /sys/class/apex/apex_0/temp
-89700
kevin@truenas:~$ cat /sys/class/apex/apex_0/status
ALIVE

I see the following in dmesg:

[  226.102104] x86/PAT: python3:49371 map pfn RAM range req uncached-minus for [mem 0xb6b4c000-0xb6b4ffff], got write-back
[  230.408555] apex 0000:02:00.0: Apex performance not throttled due to temperature
[  235.532478] apex 0000:02:00.0: Apex performance not throttled due to temperature
[  240.648492] apex 0000:02:00.0: Apex performance not throttled due to temperature
[  245.768476] apex 0000:02:00.0: Apex performance not throttled due to temperature
[  250.892476] apex 0000:02:00.0: Apex performance not throttled due to temperature
... same message repeats

Any ideas? Thanks!

Click to expand! ### Issue Type Support ### Operating System Linux ### Coral Device Mini PCIe ### Other Devices _No response_ ### Programming Language Python 3.9 ### Relevant Log Output ```shell kevin@truenas:~$ ls -l /dev/apex_0 crw-rw---- 1 root apex 120, 0 Apr 5 14:51 /dev/apex_0 kevin@truenas:~$ groups kevin plugdev builtin_users apex kevin@truenas:~$ lspci | grep TPU 02:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU kevin@truenas:~$ sudo modinfo apex filename: /lib/modules/5.15.79+truenas/updates/dkms/apex.ko author: John Joseph license: GPL v2 version: 1.2 description: Google Apex driver srcversion: 700E8BBBE9CC23C6EC17712 alias: pci:v00001AC1d0000089Asv*sd*bc*sc*i* depends: gasket retpoline: Y name: apex vermagic: 5.15.79+truenas SMP mod_unload modversions parm: allow_power_save:int parm: allow_sw_clock_gating:int parm: allow_hw_clock_gating:int parm: bypass_top_level:int parm: trip_point0_temp:int parm: trip_point1_temp:int parm: trip_point2_temp:int parm: hw_temp_warn1:int parm: hw_temp_warn2:int parm: hw_temp_warn1_en:bool parm: hw_temp_warn2_en:bool parm: temp_poll_interval:int kevin@truenas:~$ sudo modinfo gasket filename: /lib/modules/5.15.79+truenas/updates/dkms/gasket.ko author: Rob Springer license: GPL v2 version: 1.1.4 description: Google Gasket driver framework srcversion: 2CA68DA0268ABC8C7117109 depends: retpoline: Y name: gasket vermagic: 5.15.79+truenas SMP mod_unload modversions parm: dma_bit_mask:int kevin@truenas:~$ uname -r 5.15.79+truenas kevin@truenas:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 60 Model name: Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz Stepping: 3 CPU MHz: 800.000 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 6984.42 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 8 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_ tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb i nvpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm arat pln pts md_clear flush_l1d ```
hjonnala commented 1 year ago

...but, after I attempt to run the example, MSI-X becomes disabled and the temperature goes negative:

Hi, I haven't seen this type of behaviour so far. Unfortunately, I don't have any suggestions other than trying with another Coral PCIE device or another host machine. Thanks!

kevinmilner commented 1 year ago

OK, I was able to test it on another machine and can confirm that it is an issue with the machine and not the coral or PCIe adapter. Here is the working output on my other machine:

kevin@steel:~/coral$ sudo lspci -vvv | grep MSI-X
    Capabilities: [90] MSI-X: Enable+ Count=32 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
    Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
    Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=2 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
    Capabilities: [d0] MSI-X: Enable+ Count=128 Masked-
kevin@steel:~/coral$ cat /sys/class/apex/apex_0/temp
44550
kevin@steel:~/coral$ ./run_test.sh 
/home/kevin/coral/test/examples/classify_image.py:79: DeprecationWarning: ANTIALIAS is deprecated and will be removed in Pillow 10 (2023-07-01). Use LANCZOS or Resampling.LANCZOS instead.
  image = Image.open(args.input).convert('RGB').resize(size, Image.ANTIALIAS)
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
11.8ms
2.6ms
2.7ms
2.7ms
2.7ms
-------RESULTS--------
Ara macao (Scarlet Macaw): 0.75781
kevin@steel:~/coral$ sudo lspci -vvv | grep MSI-X
    Capabilities: [90] MSI-X: Enable+ Count=32 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
    Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
    Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=2 Masked-
    Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
    Capabilities: [d0] MSI-X: Enable+ Count=128 Masked-
kevin@steel:~/coral$ cat /sys/class/apex/apex_0/temp
44800

So I guess I'm out of luck using it on that server? Or do you happen to have any other ideas?

hjonnala commented 1 year ago

So I guess I'm out of luck using it on that server? Or do you happen to have any other ideas?

Unfortuately, I don't have any other ideas. Glad that you are able to work with another host machine.

google-coral-bot[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No