magic-blue-smoke / Dual-Edge-TPU-Adapter

Dual Edge TPU Adapter to use it on a system with single PCIe port on m.2 A/B/E/M slot
309 stars 4 forks source link

Dual TPU detected, but not functioning #31

Closed Drizzt321 closed 1 year ago

Drizzt321 commented 1 year ago

Hardware: ASRock Rack X470D4U Ryzen 1700X

So I got the low profile Dual-Edge TPU, along with a Dual TPU off of Ebay. Got them installed into my home NAS (slot PCIE6) to run Frigate, via a VM. Got PCIe Passthru setup and working, passing through both TPUs to the VM that's running latest Debian. Install the drivers/etc as per setup instructions, see both devices in /dev/ as I'd expect.

# ls -l /dev/apex_*
crw-rw---- 1 root apex 120, 0 Jan  1 13:07 /dev/apex_0
crw-rw---- 1 root apex 120, 1 Jan  1 13:07 /dev/apex_1

Starting up Frigate, it sees both TPUs, but the detector threads keep crashing. I decided to run the sample detection model, however it just sits there and doesn't actually run

# python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.

When I look in dmesg, I notice there are what looks like some PCI routing or interrupt issues, which is very strange

[   11.325719] apex 0000:00:0a.0: can't derive routing for PCI INT A
[   11.325721] apex 0000:00:0a.0: PCI INT A: no GSI
[   11.330283] apex 0000:00:0b.0: can't derive routing for PCI INT A
[   11.330285] apex 0000:00:0b.0: PCI INT A: no GSI

I also see some gasket messages, above those apex messages

[   11.270448] gasket: loading out-of-tree module taints kernel.
[   11.270504] gasket: module verification failed: signature and/or required key missing - tainting kernel

Host shows these PCIe devices:

2b:00.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
2c:03.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
2c:07.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
2d:00.0 Non-VGA unclassified device: Global Unichip Corp. Coral Edge TPU
2e:00.0 Non-VGA unclassified device: Global Unichip Corp. Coral Edge TPU

VM shows these PCIe devices

00:0a.0 System peripheral: Global Unichip Corp. Coral Edge TPU
00:0b.0 System peripheral: Global Unichip Corp. Coral Edge TPU

Anyone have any ideas? Bad card and I should ask for a refund? Should I get a basic PCIe holder like this one to try and get just 1 of them working to verify the Dual-Edge TPU card is fine?

magic-blue-smoke commented 1 year ago

Hi @Drizzt321 from diagnostics you're showing I can't identify any "standard" issues and can only recommend generic actions:

All adapters are tested prior to shipment, however if at any point you feel like you've tried all options available, feel free to contact me using form at the bottom of page here for replacement or refund

Drizzt321 commented 1 year ago

@magic-blue-smoke How do I detect MSI-X being enabled?

Drizzt321 commented 1 year ago

@magic-blue-smoke ok, after upgrading the host system from FreeBSD 12.2 to FreeBSD 13.1, now things at working. I still see

[    5.077526] apex 0000:00:0a.0: can't derive routing for PCI INT A
[    5.077527] apex 0000:00:0a.0: PCI INT A: no GSI
[    5.082195] apex 0000:00:0b.0: can't derive routing for PCI INT A
[    5.082196] apex 0000:00:0b.0: PCI INT A: no GSI

But the sample Coral test, worked just fine.

~/coral/pycoral$ python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
11.8ms
2.7ms
2.9ms
2.9ms
2.9ms
-------RESULTS--------
Ara macao (Scarlet Macaw): 0.75781

So looks like it was some kind of issue/bug on the host system/passthru.