magic-blue-smoke / Dual-Edge-TPU-Adapter

Dual Edge TPU Adapter to use it on a system with single PCIe port on m.2 A/B/E/M slot
255 stars 3 forks source link

TPU (with PCIe adapter) not functioning/Throwing pcieport/Apex errors #43

Closed lschapker closed 8 months ago

lschapker commented 9 months ago

This is a "self built" AMD Epyc server: Motherboard: AS Rock Rack ROMED8-2T (on latest BIOS version) CPU: Epyc 7302 OS: Proxmox 8 TPU adapter installed in pcie slot 7 (have also tried in slot 6)

trying to Follow: https://github.com/Bytelake/Coral-in-LXC for install

Just received "Dual Edge TPU Adapter - PCIe x1 Low Profile" and installed my Dual TPU (also new. No way to test otherwise without this adapter)

Upon booting I receive

[    5.197351] pcieport 0000:80:01.1: Data Link Layer Link Active not set in 1000 msec
[    5.197355] pcieport 0000:80:01.1: AER: subordinate device reset failed
[    5.197367] pcieport 0000:80:01.1: AER: device recovery failed
[    5.197370] pcieport 0000:80:01.1: DPC: containment event, status:0x1f01 source:0x0000
[    5.197371] pcieport 0000:80:01.1: DPC: unmasked uncorrectable error detected
[    5.197378] pcieport 0000:80:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[    5.197465] pcieport 0000:80:01.1:   device [1022:1483] error status/mask=00090000/04000000
[    5.197542] pcieport 0000:80:01.1:    [16] UnxCmplt              
[    5.197611] pcieport 0000:80:01.1:    [19] ECRC                   (First)
[    5.197682] pcieport 0000:80:01.1: AER:   TLP Header: 4a008001 84000004 80002100 00000000
[    5.197762] pci 0000:83:00.0: AER: can't recover (no error_detected callback)
[    5.197764] pci 0000:84:00.0: AER: can't recover (no error_detected callback)

[    8.042192] apex 0000:83:00.0: Unable to change power state from D3cold to D0, device inaccessible

[ 1433.786816] apex 0000:83:00.0: Apex performance not throttled due to temperature
[ 1436.346787] apex 0000:84:00.0: Apex performance not throttled due to temperature
[ 1438.906750] apex 0000:83:00.0: Apex performance not throttled due to temperature
[ 1441.466714] apex 0000:84:00.0: Apex performance not throttled due to temperature
[ 1444.026694] apex 0000:83:00.0: Apex performance not throttled due to temperature
[ 1446.586653] apex 0000:84:00.0: Apex performance not throttled due to temperature
[ 1449.146624] apex 0000:83:00.0: Apex performance not throttled due to temperature

[ 2092.722612] apex 0000:83:00.0: RAM did not enable within timeout (12000 ms)
[ 2092.722651] apex 0000:83:00.0: Error in device open cb: -110

After booting, "lspci" sees the 2 TPU cores. Files "/dev/apex_0" and "/dev/apex_1" exist.

When I move adapter to slot 6, "0000:80:01.1" above changes to "0000:c0:01.1".

I'm kind of new to Linux at this level. Not sure how to go about debugging this issue. I've done a bunch of google searching, and not finding a whole lot. Thank yoU!

lschapker commented 9 months ago

... root@proxmox-pr:~# lspci |grep -i tpu 83:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU 84:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU ...

lschapker commented 9 months ago

I have also attempted to install the Adapter/TPU in my "very old" desktop computer (running Debian 12). Sadly, it will not post giving 1 long and 3 or 4 short beeps. Changing PCIe Slots arrives at the same destination: No post. Remove the card and it posts fine.

lschapker commented 8 months ago

Update: I purchased different "adapter" (a B/M key to "A/E" PCIe adapter) to be able to determine whether I had a "TPU" or an "adapter" issue. Adding a "double adapter" (i.e. Main board PCIe x16 slot to B/M key PCIe, B/M PCIe to A/E PCIe, then the TPU), my base OS (proxmox8) could not see the TPU (I presumed that the TPU is defective). Replace the TPU, using the original PCIe to E key PCIe adapter, the OS can now see both TPUs (and my LxC instance can see and use at least 1 of the cores. I have more debugging to do).

So for summary, I had a defective TPU (Thank goodness for Amazon's return policy!).