google-coral / edgetpu

Coral issue tracker (and legacy Edge TPU API source)
https://coral.ai
Apache License 2.0
429 stars 125 forks source link

Unraid PCIe issues #647

Open markmghali opened 2 years ago

markmghali commented 2 years ago

Description

Hello,

I have been trying to get your PCIe adaptor to work for a few months now with no luck. I am using unraid with Frigate v0.10 Docker container. I can see both TPUs as apex_0 and apex_1. Symptom is Frigate will un for a bot then I get a PCIe error in my syslog for unraid. IT will then shutdown one of the TPUs and the Temp goes negative. I have posted my issues in the Frigate github and the unraid forums with no luck. I have reposted my unraid post below. Please let me know what else I can troubleshoot. Love all the work you have done for the community hoping to get this to work properly.

I am having a similar issue to @AdvancedMobileRepairs Using the Dual TPU in Magic-Blue-smoke PCIe adapter. Prior to this I was using a single TPU with a different adapter that was working fine. I have been monitoring the Coral Temperatures at they have not been going above 48 Degrees. I have this error in my syslog:

image

If anyone has any insight into this? I already asked in the Frigate github and we troubleshooted to a point but then they told me to ask in the unraid forum.

Thank you

EDIT EDIT:

Per this thread:

https://forums.unraid.net/topic/103901-solved-aer-pcie-bus-errors/

I disabled ASPM on PCIe in my BIOS. restarted server and running frigate to see how long it works before the coral shuts down.

And it failed again! That did not fix the issue. very weird

image

Temp is not the issue it seems

image

Any insight?

Click to expand! ### Issue Type Build/Install, Performance ### Operating System Linux ### Coral Device M.2 Accelerator with dual Edge TPU ### Other Devices _No response_ ### Programming Language _No response_ ### Relevant Log Output _No response_
hjonnala commented 2 years ago

Hi @markmghali can you please share how did you turn off pcie_aspm. have you added pcie_aspm=off to the /boot/extlinux/extlinux.conf?

$ cat /boot/extlinux/extlinux.conf
TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
MENU LABEL primary kernel
LINUX /boot/Image
INITRD /boot/initrd
APPEND ${cbootargs} quiet pcie_aspm=off
markmghali commented 2 years ago

I disabled it via BIOS on my motherboard

image

image

let me look at this file and let you know

markmghali commented 2 years ago

I am on unraid it does not seem to have an extlinux folder. seems like the other user added it to their syslinux config? not sure how to check that

image

markmghali commented 3 months ago

yes I have added pcie_aspm=off in the OS and in the BIOS as well I am still having this issue only with this Magic-Blue-smoke PCIe adapter. Prior to this I was using a single TPU with a different adapter that was working fine. It is just this adapter for some reason. this is the adapter: https://www.makerfabs.com/dual-edge-tpu-adapter.html root@RAID:~# dmesg | grep apex [ 58.428393] apex 0000:83:00.0: enabling device (0100 -> 0102) [ 58.431805] apex 0000:84:00.0: enabling device (0100 -> 0102) [ 63.448106] apex 0000:84:00.0: Apex performance not throttled due to temperature [ 63.448121] apex 0000:83:00.0: Apex performance not throttled due to temperature