google-coral / edgetpu

Coral issue tracker (and legacy Edge TPU API source)
https://coral.ai
Apache License 2.0
426 stars 125 forks source link

Installation failing on Raspberry Pi CM4 for PCI-E driver #280

Open timonsku opened 3 years ago

timonsku commented 3 years ago

Following the installation guide for the M.2 I get several compilation errors when its trying to install gasket. Here the log of the make process: gasket-make.log

It seems its mostly the 3 same errors invalid use of undefined type ‘struct msix_entry’’ implicit declaration of function ‘writeq_relaxed’; did you mean ‘writel_relaxed’ implicit declaration of function ‘readq_relaxed’; did you mean ‘readw_relaxed’ implicit declaration of function ‘pci_disable_msix’; did you mean ‘pci_disable_sriov’

This is using gcc version 8.3.0 using the latest Raspbian with Kernel 5.4.51-v7l+ Unsure whether this is compiler, kernel header or code issues.

Namburger commented 3 years ago

Hello @timonsku we have investigated the CM4 previously and unfortunately, we determined that it won't works with our PCIe modules as the CPU doesn't have MSI-X supports as required by our requirements.

timonsku commented 3 years ago

Hey Namburger, the pi engineers have worked on this and have added support for MSI-X in the latest kernel. See this forum discussion: https://www.raspberrypi.org/forums/viewtopic.php?p=1772216&sid=fa34ae6597591c1f80cb68c8138c6a67#p1772216

Namburger commented 3 years ago

As I mentioned, we have explored this path and there is still a little on going efforts but I don't believe it is something we can promise. @mbrooksx might be able to give you more info on this

timonsku commented 3 years ago

Oh I see. If it doesn't turn out to be a true hw limitation I would be very interested in seeing this getting supported. I currently have hardware in development that would see good use of the M.2 modules.

usbguru commented 3 years ago

@timonsku Unfortunately this ARM hardware does not support MSI-X. The raspberry pi discussion you referenced raised my hopes that limited performance with emulated interrupts might work. Although it still does not work, the on-going work is encouraging, and might lead to performance nearly as good as if the original MSI-X hardware interrupts were on the ARM silicon. Stay tuned!

mbrooksx commented 3 years ago

@timonsku : Yes, I'm actively working with the people in the Pi forum discussion. While MSI-X isn't technically supported by the BCM2711, as you saw from that patch if SW indicates it works then the PCIe hardware is actually able to map some MSI-X interrupts correctly.

We've validated farther than you have (including MSI-X), your errors are because you're building for the 32-bit kernel but the driver expects 64-bit read/write (thus why writeq/readq don't exist). My plan is to customize the driver for Pi (including 32-bit workarounds) and likely submit it to the Pi kernel vs trying to update our DKMS package. Will keep you informed of the status.

timonsku commented 3 years ago

Awesome that is great to hear :)

Valdiolus commented 3 years ago

Great to hear that somebody is working on this issue! Already received my RPI CM4 + IO Board + PCIe Coral acc. Any news? Maybe I can help?

markus-k commented 3 years ago

Has anyone had a go at this? I've done a bit of debugging and hacking myself and got the kernel module to load and libedgetpu to start an inference (although it never finishes, some event is missing, and there is an HIB error?).

There are some changes needed in both the kernel module and the user-space drivers, so far primarily replacing 64bit memory accesses with two 32bit ones. My progress is here for the module which I have updated to the latest version from the dkms package and here for libedgetpu, but these changes are of course nowhere near merge-quality.

This is what libedgetpu logs:

I :273] Starting in normal mode
I :83] Opening /dev/apex_0. read_only=0
I :97] mmap_offset=0x0000000000040000, mmap_size=4096
I :108] Got map addr at 0x0xb6fde000
I :97] mmap_offset=0x0000000000044000, mmap_size=4096
I :108] Got map addr at 0x0xb6fdd000
I :97] mmap_offset=0x0000000000048000, mmap_size=4096
I :108] Got map addr at 0x0xb6fdc000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000000, w0=0x00000000, w1=0x00000000
I :191] Write: offset = 0x00000000000487a8, value = 0x0000000000000000
I :229] Read: offset = 0x0000000000048578, value: = 0x0000000000000010, w0=0x00000010, w1=0x00000000
I :136] MmuMapper#Map() : 00000000b6627000 -> 0000000001000000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001000000
I :169] Queue base : 0xb6627000 -> 0x0000000001000000 [4096 bytes]
I :136] MmuMapper#Map() : 00000000b6628000 -> 0000000001001000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001001000
I :179] Queue status block : 0xb6628000 -> 0x0000000001001000 [16 bytes]
I :191] Write: offset = 0x0000000000048590, value = 0x0000000001000000
I :191] Write: offset = 0x0000000000048598, value = 0x0000000001001000
I :191] Write: offset = 0x00000000000485a0, value = 0x0000000000000100
I :191] Write: offset = 0x0000000000048568, value = 0x0000000000000005
I :229] Read: offset = 0x0000000000048570, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x00000000000486d0, value: = 0x0000000000000000, w0=0x00000000, w1=0x00000000
I :191] Write: offset = 0x0000000000044018, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044158, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044198, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000441d8, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044218, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000048788, value = 0x000000000000007f
I :229] Read: offset = 0x0000000000048788, value: = 0x000000000000007f, w0=0x0000007f, w1=0x00000000
I :191] Write: offset = 0x00000000000400c0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040150, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040110, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040250, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040298, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000402e0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040328, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040190, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000401d0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040210, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000486e8, value = 0x0000000000000000
I :45] Set event fd : event_id:0 -> event_fd:7,
I :45] Set event fd : event_id:4 -> event_fd:11,
I :62] event_fd=7. Monitor thread begin.
I :45] Set event fd : event_id:5 -> event_fd:12,
I :45] Set event fd : event_id:6 -> event_fd:13,
I :62] event_fd=12. Monitor thread begin.
I :62] event_fd=11. Monitor thread begin.
I :45] Set event fd : event_id:7 -> event_fd:14,
I :62] event_fd=13. Monitor thread begin.
I :45] Set event fd : event_id:8 -> event_fd:15,
I :62] event_fd=14. Monitor thread begin.
I :45] Set event fd : event_id:9 -> event_fd:16,
I :45] Set event fd : event_id:10 -> event_fd:17,
I :62] event_fd=15. Monitor thread begin.
I :45] Set event fd : event_id:11 -> event_fd:18,
I :62] event_fd=16. Monitor thread begin.
I :62] event_fd=17. Monitor thread begin.
I :45] Set event fd : event_id:12 -> event_fd:19,
I :62] event_fd=18. Monitor thread begin.
I :191] Write: offset = 0x00000000000486a0, value = 0x000000000000000f
I :191] Write: offset = 0x00000000000485c0, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000001
I :172] Opening device at /dev/apex_0
I :62] event_fd=19. Monitor thread begin.
I :75] event_fd=19. Monitor thread got num_events=1.
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :191] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :191] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
I :47] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :58] Adding output "prediction" with 965 bytes.
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.
I :310] Request [0]: Submitting P0 request immediately.
I :373] Request [0]: Need to map parameters.
I :136] MmuMapper#Map() : 00000000ad93d000 -> 8000000000000000 (953 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000000000
I :252] Mapped params : Buffer(ptr=0xad93d000) -> 0x8000000000000000, 3900864 bytes.
I :252] Mapped params : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :387] Request [0]: Need to do parameter-caching.
I :80] [0] Request constructed.
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :368] MapDataBuffers() done.
I :187] Linking Parameter: 0x8000000000000000
I :136] MmuMapper#Map() : 0000000001266000 -> 8000000000400000 (3 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000400000
I :223] Mapped "instructions" : Buffer(ptr=0x1266000) -> 0x8000000000400000, 9680 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [0] SetState old=0, new=1.
I :393] [0] NotifyRequestSubmitted()
I :481] [0] SetState old=1, new=2.
I :83] Request[0]: Submitted
I :401] [0] NotifyRequestActive()
I :481] [0] SetState old=2, new=3.
I :133] Request[0]: Scheduling DMA[0]
I :394] Adding an element to the host queue.
I :191] Write: offset = 0x00000000000485a8, value = 0x0000000000000001
I :80] [1] Request constructed.
I :113] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :188] Adding output "prediction" with 965 bytes.
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :136] MmuMapper#Map() : 0000000001226000 -> 8000000000440000 (38 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000440000
I :223] Mapped "map/TensorArrayStack/TensorArrayGatherV3" : Buffer(ptr=0x1226440) -> 0x8000000000440440, 150528 bytes. Direction=1
I :136] MmuMapper#Map() : 0000000001276000 -> 8000000000404000 (1 pages) flags=00000004.
I :55] MapMemory() page-aligned : device_address = 0x8000000000404000
I :223] Mapped "prediction" : Buffer(ptr=0x1276000) -> 0x8000000000404000, 968 bytes. Direction=2
I :368] MapDataBuffers() done.
I :93] Linking map/TensorArrayStack/TensorArrayGatherV3[0]: 0x8000000000440440
I :93] Linking prediction[0]: 0x8000000000404000
I :136] MmuMapper#Map() : 00000000012b9000 -> 8000000000420000 (32 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000420000
I :223] Mapped "instructions" : Buffer(ptr=0x12b9000) -> 0x8000000000420000, 129536 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [1] SetState old=0, new=1.
I :393] [1] NotifyRequestSubmitted()
I :481] [1] SetState old=1, new=2.
I :83] Request[1]: Submitted
I :401] [1] NotifyRequestActive()
I :481] [1] SetState old=2, new=3.
I :133] Request[1]: Scheduling DMA[0]
I :394] Adding an element to the host queue.
I :191] Write: offset = 0x00000000000485a8, value = 0x0000000000000002

Also the only interrupt firing seems to be the fatal error one:

cat /sys/class/apex/apex_0/interrupt_counts
0x00: 0
0x01: 0
0x02: 0
0x03: 0
0x04: 0
0x05: 0
0x06: 0
0x07: 0
0x08: 0
0x09: 0
0x0a: 0
0x0b: 0
0x0c: 2
Namburger commented 3 years ago

@markus-k woa, thanks for sharing that @mbrooksx for awareness

hiwudery commented 3 years ago

@markus-k thank your for your sharing. I add othbootargs=gasket.dma_bit_mask=32 to avoid HIB error. But after running the sample program, I still get the following errors. Did you have any ideas ? (Rasbian OS is 32bit; all the code is download from markus-k's repo) Thank you -Jack

messageImage_1612070152087 messageImage_1612070000068

markus-k commented 3 years ago

@hiwudery That's weird. Your upper and lower 32bits are cloned when reading from the device (see the line with I :229), which my patch should fix. Maybe the compiler optimized the two reads into one ldrd? But since that still performs two 32bit accesses, I don't really understand why that happens.

I just tried setting dma_bit_mask but still get HIB Errors, in addition to out of memory errors when mapping buffers. Also from dmesg:

[  971.201472] apex 0000:01:00.0: gasket_perform_mapping i 0
[  971.201480] apex 0000:01:00.0: gasket_page_table_map done: ha b657c000 daddr 1000000 num 1, flags 0 ret 0
[  971.201552] apex 0000:01:00.0: gasket_perform_mapping i 0
[  971.201558] apex 0000:01:00.0: gasket_page_table_map done: ha b657d000 daddr 1001000 num 1, flags 0 ret 0
[  971.271839] apex 0000:01:00.0: gasket_alloc_extended_subtable -> fail to map page ffffffffffffffff [pfn 6d9fed66 phys 732d8923]
[  971.271854] apex 0000:01:00.0: no memory for extended addr subtable
[  971.271861] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[  971.271868] apex 0000:01:00.0: gasket_page_table_map done: ha ad63c000 daddr 8000000000000000 num 953, flags 2 ret -12
[  971.271907] apex 0000:01:00.0: gasket_alloc_extended_subtable -> fail to map page ffffffffffffffff [pfn 6d9fed66 phys 732d8923]
[  971.271915] apex 0000:01:00.0: no memory for extended addr subtable
[  971.271921] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[  971.271928] apex 0000:01:00.0: gasket_page_table_map done: ha ad63c000 daddr 8000000000000000 num 953, flags 0 ret -12

I'm also not sure if dma_bit_mask is right here. The comment says it's used for PCIe controller which can't do 64bit addressing, but the Raspberry Pis PCIe controller can do 64bit addressing, but only 32bit wide accesses (as noted by PhilE here).

mbrooksx commented 3 years ago

Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are: https://github.com/mbrooksx/libedgetpu (Userspace) https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)

Note that I added an additional print - the host-side page address for the failed DMA transaction (it reports 0x100004000000000 - which is outside of the Pi RAM). The hope is that dma_bit_mask and command line swiotlb=65536 would create shadow registers in the 32-bit space but the Pi PCIe restrictions are very challenging. It is likely the coherent memory (setup in libedgetpu) is corrupted and thus the shared memory between the two is passing invalid information.

The other option that may be easier is the 32-bit kernel. It has issues with allocating enough BAR memory, but with some device tree tweaks this could likely be fixed. This paired with the 32-bit "aware" user-space may be an easier path. I've asked the Pi team to investigate this as well.

geerlingguy commented 3 years ago

@mbrooksx - And for the benefit of anyone who hasn't touched BAR space allocations, here's a guide I wrote on it a few months back testing graphics cards on the CM4: https://gist.github.com/geerlingguy/9d78ea34cab8e18d71ee5954417429df

The latest 5.10.y kernels for Pi OS already increased the default allocation to 1 GB I think (maybe even 4 or 8 GB? I don't remember if I followed up and checked on those commits).

markus-k commented 3 years ago

Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are: https://github.com/mbrooksx/libedgetpu (Userspace) https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)

Note that I added an additional print - the host-side page address for the failed DMA transaction (it reports 0x100004000000000 - which is outside of the Pi RAM). The hope is that dma_bit_mask and command line swiotlb=65536 would create shadow registers in the 32-bit space but the Pi PCIe restrictions are very challenging. It is likely the coherent memory (setup in libedgetpu) is corrupted and thus the shared memory between the two is passing invalid information.

The other option that may be easier is the 32-bit kernel. It has issues with allocating enough BAR memory, but with some device tree tweaks this could likely be fixed. This paired with the 32-bit "aware" user-space may be an easier path. I've asked the Pi team to investigate this as well.

Alright, at least I haven't been looking in the completely wrong place. I've done most of my debugging on a 32-bit kernel so far. The default BAR space seems to be 1GB, I'm not sure if that's enough, but I'm not seeing any BAR allocation errors.

In case this helps anyone, some more debug logs. I've added your additional debug print, on a 32-bit kernel without any additional parameters:

[   77.630936] apex 0000:01:00.0: Fault VA: 0x0
[   77.630952] apex 0000:01:00.0: Fault VA: 0x0
[   77.635926] apex 0000:01:00.0: Fault VA: 0x0
[   77.635940] apex 0000:01:00.0: Fault VA: 0x0
[   77.635953] apex 0000:01:00.0: Fault VA: 0x0
[   77.635966] apex 0000:01:00.0: Fault VA: 0x0
[   77.635978] apex 0000:01:00.0: Fault VA: 0x0
[   77.635990] apex 0000:01:00.0: Fault VA: 0x0
[   77.636002] apex 0000:01:00.0: Fault VA: 0x0
[   77.636014] apex 0000:01:00.0: Fault VA: 0x0
[   83.141193] apex 0000:01:00.0: Fault VA: 0x1001000
[   83.141216] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   83.141237] apex 0000:01:00.0: Computed Failing Bus Addr: 0x40c800000
[   83.141259] apex 0000:01:00.0: Fault VA: 0x1001000
[   83.141277] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   83.141296] apex 0000:01:00.0: Computed Failing Bus Addr: 0x40c800000
[   83.141320] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[   83.141345] apex 0000:01:00.0: Fault VA: 0xffffffff
[   83.141362] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[   83.141381] apex 0000:01:00.0: Computed Failing Bus Addr: 0x0
[   83.141402] apex 0000:01:00.0: Fault VA: 0x0
[   83.150222] apex 0000:01:00.0: Fault VA: 0x0
[   83.150243] apex 0000:01:00.0: Fault VA: 0x0
[   83.150263] apex 0000:01:00.0: Fault VA: 0x0
[   83.150284] apex 0000:01:00.0: Fault VA: 0x0
[   83.150309] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff

I've also tried using gasket.dma_bit_mask=32 swiotlb=65536 on a 32-bit kernel:

[   41.372303] apex 0000:01:00.0: Fault VA: 0x0
[   41.372321] apex 0000:01:00.0: Fault VA: 0x0
[   41.378062] apex 0000:01:00.0: Fault VA: 0x0
[   41.378079] apex 0000:01:00.0: Fault VA: 0x0
[   41.378094] apex 0000:01:00.0: Fault VA: 0x0
[   41.378109] apex 0000:01:00.0: Fault VA: 0x0
[   41.378124] apex 0000:01:00.0: Fault VA: 0x0
[   41.378139] apex 0000:01:00.0: Fault VA: 0x0
[   41.378153] apex 0000:01:00.0: Fault VA: 0x0
[   41.378168] apex 0000:01:00.0: Fault VA: 0x0
[   41.628343] ------------[ cut here ]------------
[   41.628367] WARNING: CPU: 3 PID: 707 at kernel/dma/swiotlb.c:683 swiotlb_map+0x38c/0x43c
[   41.628374] apex 0000:01:00.0: swiotlb addr 0x0000000415400000+4096 overflow (mask ffffffff, bus limit 47fffffff).
[   41.628379] Modules linked in: sha256_generic cfg80211 rfkill 8021q garp stp llc binfmt_misc v3d raspberrypi_hwmon vc4 gpu_sched dwc2 cec roles drm_kms_helper drm bcm2835_isp(C) i2c_bcm2835 bcm2835_codec(C) bcm2835_v4l2(C) drm_panel_orientation_quirks v4l2_mem2mem bcm2835_mmal_vchiq(C) videobuf2_dma_contig videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc apex(C) snd_soc_core vc_sm_cma(C) gasket(C) snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd syscopyarea sysfillrect sysimgblt fb_sys_fops backlight rpivid_mem uio_pdrv_genirq uio i2c_dev ip_tables x_tables ipv6
[   41.628599] CPU: 3 PID: 707 Comm: python3 Tainted: G         C        5.10.6-v7l+ #6
[   41.628602] Hardware name: BCM2711
[   41.628605] Backtrace:
[   41.628617] [<c0b84b94>] (dump_backtrace) from [<c0b84f24>] (show_stack+0x20/0x24)
[   41.628621]  r7:ffffffff r6:00000000 r5:60000013 r4:c12e6c98
[   41.628626] [<c0b84f04>] (show_stack) from [<c0b892bc>] (dump_stack+0xcc/0xf8)
[   41.628632] [<c0b891f0>] (dump_stack) from [<c02216d4>] (__warn+0xfc/0x114)
[   41.628637]  r10:00001000 r9:00000009 r8:c02a5a50 r7:000002ab r6:00000009 r5:c02a5a50
[   41.628640]  r4:c0e3cd00 r3:c1205094
[   41.628645] [<c02215d8>] (__warn) from [<c0b856c8>] (warn_slowpath_fmt+0xa4/0xd8)
[   41.628648]  r7:000002ab r6:c0e3cd00 r5:c1205048 r4:c0e3ccbc
[   41.628654] [<c0b85628>] (warn_slowpath_fmt) from [<c02a5a50>] (swiotlb_map+0x38c/0x43c)
[   41.628658]  r9:c1b8b070 r8:c1205048 r7:00000000 r6:ffffffff r5:00000000 r4:ffffffff
[   41.628664] [<c02a56c4>] (swiotlb_map) from [<c02a0668>] (dma_map_page_attrs+0x254/0x394)
[   41.628668]  r10:00000001 r9:00001000 r8:c1b8b1e0 r7:00000000 r6:ffffffff r5:c1205048
[   41.628671]  r4:c1b8b070
[   41.628690] [<c02a0414>] (dma_map_page_attrs) from [<bf115184>] (gasket_map_extended_pages+0x100/0x45c [gasket])
[   41.628694]  r10:00000000 r9:c4112000 r8:c32ab700 r7:f09dc000 r6:00000200 r5:000003b9
[   41.628697]  r4:f085d018
[   41.628717] [<bf115084>] (gasket_map_extended_pages [gasket]) from [<bf115900>] (gasket_page_table_map+0xa8/0x100 [gasket])
[   41.628721]  r10:c32ab740 r9:ad63c000 r8:00000000 r7:80000000 r6:c2f97c00 r5:c32ab700
[   41.628724]  r4:000003b9
[   41.628741] [<bf115858>] (gasket_page_table_map [gasket]) from [<bf112a9c>] (gasket_map_buffers_common+0x90/0xa8 [gasket])
[   41.628745]  r10:00000005 r9:00000001 r8:c30e1180 r7:4028dc0c r6:c2f97c00 r5:c2f97c00
[   41.628748]  r4:c32a5d90
[   41.628767] [<bf112a0c>] (gasket_map_buffers_common [gasket]) from [<bf112cac>] (gasket_handle_ioctl+0x1f8/0x8e0 [gasket])
[   41.628770]  r5:beb40fa0 r4:c1205048
[   41.628788] [<bf112ab4>] (gasket_handle_ioctl [gasket]) from [<bf1106f8>] (gasket_ioctl+0x9c/0x118 [gasket])
[   41.628792]  r9:beb40fa0 r8:c2f97c00 r7:bf09a1b0 r6:4028dc0c r5:c30e1180 r4:c1205048
[   41.628805] [<bf11065c>] (gasket_ioctl [gasket]) from [<c0451180>] (sys_ioctl+0x1d4/0x8ec)
[   41.628809]  r9:c32a4000 r8:00000000 r7:c30e1180 r6:c30e1181 r5:c1205048 r4:4028dc0c
[   41.628815] [<c0450fac>] (sys_ioctl) from [<c0200040>] (ret_fast_syscall+0x0/0x28)
[   41.628818] Exception stack(0xc32a5fa8 to 0xc32a5ff0)
[   41.628822] 5fa0:                   beb40f9c 00000000 00000005 4028dc0c beb40fa0 00000005
[   41.628826] 5fc0: beb40f9c 00000000 b454da7c 00000036 00000001 01f0349c 00000000 b48a4bbc
[   41.628829] 5fe0: b454db58 beb40f74 b443ba3f b6cd551c
[   41.628833]  r10:00000036 r9:c32a4000 r8:c0200204 r7:00000036 r6:b454da7c r5:00000000
[   41.628836]  r4:beb40f9c
[   41.628840] ---[ end trace a2d67e6b70f87dd2 ]---
[   41.628855] apex 0000:01:00.0: no memory for extended addr subtable
[   41.628861] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[   41.628911] apex 0000:01:00.0: no memory for extended addr subtable
[   41.628917] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[   41.646322] apex 0000:01:00.0: Fault VA: 0x1001000
[   41.646330] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   41.646338] apex 0000:01:00.0: Computed Failing Bus Addr: 0xc800000
[   41.646347] apex 0000:01:00.0: Fault VA: 0x1001000
[   41.646352] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[   41.646359] apex 0000:01:00.0: Computed Failing Bus Addr: 0xc800000
[   41.646372] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[   41.646384] apex 0000:01:00.0: Fault VA: 0xffffffff
[   41.646389] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[   41.646396] apex 0000:01:00.0: Computed Failing Bus Addr: 0xdeadbeef
[   41.646405] apex 0000:01:00.0: Fault VA: 0x0
[   41.648266] apex 0000:01:00.0: Fault VA: 0x0
[   41.648275] apex 0000:01:00.0: Fault VA: 0x0
[   41.648283] apex 0000:01:00.0: Fault VA: 0x0
[   41.648292] apex 0000:01:00.0: Fault VA: 0x0
[   41.648305] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff

In this case mapping the buffer fails in libedgetpu:

I :192] Write: offset = 0x00000000000486a0, value = 0x000000000000000f
I :62] event_fd=19. Monitor thread begin.
I :192] Write: offset = 0x00000000000485c0, value = 0x0000000000000001
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :192] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :231] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :172] Opening device at /dev/apex_0
I :231] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :192] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :231] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :231] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
I :47] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :58] Adding output "prediction" with 965 bytes.
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.
I :310] Request [0]: Submitting P0 request immediately.
I :373] Request [0]: Need to map parameters.
I :118] Failed to map buffer with flags, error -1
Traceback (most recent call last):
  File "classify_image.py", line 126, in <module>
    main()
  File "classify_image.py", line 115, in main
    interpreter.invoke()
  File "/home/pi/venv/lib/python3.7/site-packages/tflite_runtime/interpreter.py", line 540, in invoke
    self._interpreter.Invoke()
RuntimeError: Failed to execute request. Could not map pages : 5 (Cannot allocate memory)Node number 1 (EdgeTpuDelegateForCustomOp) failed to invoke.

I :226] Releasing Edge TPU device at /dev/apex_0
I :178] Closing Edge TPU device at /dev/apex_0
hiwudery commented 3 years ago

@markus-k in gasket_page_table.c, the page table is 64bit format not 32bit format. I think the gasket_page_table also need to modify in 32bit kernel.

geerlingguy commented 3 years ago

I also wanted to note something here that may be of interest—I noticed earlier someone mentioned writeq being present on 64-bit OSes. I'll soon be testing the Coral TPU (M.2 A+E key version) on a Pi so haven't yet had first-hand experience, but with a different driver I was taking a look at, it seems that one problem may be that writeq is not supported on Pi OS / the Pi's PCI-E bus like it may be on some other 64-bit systems.

Edit: New bug reported relating to that driver issue is here: https://github.com/raspberrypi/linux/issues/4158

geerlingguy commented 3 years ago

On 64-bit Pi OS (with latest kernel compiled at 5.10.14-v8+), I get the following kernel panic after running through the default steps in the setup guide:

IMG_3633

(Cross-linking to https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/44#issuecomment-780912830)

markus-k commented 3 years ago

You should probably read the rest of this issue, there hasn't been any development since my last comment to my knowledge. The default gasket module won't work at all, my fixed one at least loads and can read temperature, but something is still wrong with the DMA, so it won't work either. Then there's probably still a few other things broken in the user space driver as well.

I don't have the time to dig into this right now, and my knowledge with kernel dev is limited anyway. So best we can do is hope someone with deep understanding of how the DMA and TPU works can find some time and look into it.

timonsku commented 3 years ago

@mbrooksx sounded like Google was working on it? Maybe he could update us. I still have very big interest in this for my product but don't have the resources or know-how to dig into this.

markus-k commented 3 years ago

If someone at Google is working on it, or is going to, it would be nice to get a very rough ETA (weeks, months) on when we can expect to know whether or not the TPU will ever work over PCIe on a CM4. I'll be creating a new revision of my products PCB in few weeks, and if there's very little chance the PCIe TPU won't work anytime soon, I'll have to switch both to USB.

timonsku commented 3 years ago

Yea similar situation for me.

mbrooksx commented 3 years ago

I unfortunately don't have an estimated date. The CM4 PCIe hardware is antiquated, and there are endless hacks required to try to have it operate competently (note that the TPU is a PCIe bus master, and I don't see any evidence of a bus master ever being tested with the CM4). We haven't been receiving the support needed from the Pi team, so for now it's continuing to try things to understand the issues with communication (at this point it seems an issue with the shared memory). It may be within the next few weeks for operation (in which case I would post the hacked up version for your evaluation while we decide the best way to release this without polluting the main Coral codebase). I will keep this thread up to date.

Depending on the board configuration, USB may be a better choice.

mbrooksx commented 3 years ago

My latest theory isn't encouraging (note that this would be really easy to solve in a non-COVID world, where would just plug this into a PCIe bus analyzer and see what data the CM4 is malforming):

When you run a model through the compiler it assigns virtual memory locations for the various operations, scratch memory, weights, etc. There are two mappings these addresses use to map to physical pages, what the driver calls simple and extended. The issue is that the way to differentiate simple and extended is the 63rd bit of the virtual address. So when the shared coherent memory between the CPU and TPU has been established - the TPU reads in this region to get the address of information it needs (in this case it's the location of the instruction queue). But because of the CM4's crippled PCIe bus, it is reading only 32bits of the virtual address - which means it interprets every read as a simple read.

The problem then is it will attempt to mmap this to the system and it will get wrong data (since the correct mapping was via the extended approach). The problem is the TPU is doing these reads (including checking the 64-bit) in hardware, we have no way to change which bit indicates extended mapping. If this is indeed the primary source of failure, it would require a hacked up version of the compiler that assigns everything into simple mapping - this would cripple the maximum size of the model, parameters, etc that is allowed.

I'll explore that option if we can verify this is indeed the cause.

kampff commented 3 years ago

(Thank you everyone for working on this issue!) I have a new setup (Custom CM4 carrier with M.2 PCIe-EdgeTPU) and would love to help get this integration working. Are the following repos still the latest progress in userspace/kernel?

Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are: https://github.com/mbrooksx/libedgetpu (Userspace) https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)

julled commented 3 years ago

It would be so sad if it would never be possible to use the Coral Boards via PCIE on the CM4. The combo is the perfect high performance - low power - compact formfactor - multi camera - mainline kernel supported - embedded inference platform. Please please find a way to make it useable.

mbrooksx commented 3 years ago

I completely agree about the potential with the combination. At this point, it looks like a irreparable hardware issue with the antiquated CM4 PCIe module. I have forced all the allocations into simple mapping (see above for more info about this) so that all the virtual addresses are 32-bit, as well as previously setting all reads/writes to 32-bit. However, the device itself (in hardware) makes reads/writes in the coherent cache - all of these read/writes are 64-bits.

For now, the plan is to wait until the office is open so we can use a PCIe analyzer and confirm this hypothesis. But there doesn't appear to be any additional changes that we can do in SW - the device expecting a host to be able to perform 64-bit read/write is built into the hardware.

USB is still the recommendation for the CM4. USB2.0 is possible out of box, and USB3.0 may be possible although extra design considerations are required (more info here: https://coral.ai/products/accelerator-module/).

kampff commented 3 years ago

Choosing to believe this is still possible...here are my current DMESG and libedgetpu logs: (Kernel: 5.10.23-v8+ (aarch64) with gasket/apex modules and libedgetpu from mbooksx's repos, custom Buildroot Rootfs)

DMESG

[ 1876.006541] apex 0000:01:00.0: Fault VA: 0xffffffff
[ 1876.012884] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[ 1876.024280] apex 0000:01:00.0: Computed Failing Bus Addr: 0x0
[ 1876.031596] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.042358] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.048153] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.053923] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.059681] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.065456] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.071141] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.076769] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.082370] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.089568] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f89c74000, dev_addr 0x1000000, num_pages 1
[ 1876.100752] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f89c75000, dev_addr 0x1001000, num_pages 1
[ 1876.160486] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f5f969000, dev_addr 0x0, num_pages 1603
[ 1876.171885] apex 0000:01:00.0: Map Simple Pages: host_addr 0xd9c3000, dev_addr 0x1004000, num_pages 3
[ 1876.185214] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f88350000, dev_addr 0x1080000, num_pages 66
[ 1876.196648] apex 0000:01:00.0: Map Simple Pages: host_addr 0xd9c7000, dev_addr 0x1002000, num_pages 2
[ 1876.208103] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f88272000, dev_addr 0x1040000, num_pages 44
[ 1876.219712] apex 0000:01:00.0: Map Simple Pages: host_addr 0xd9ca000, dev_addr 0x1008000, num_pages 2
[ 1876.230804] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f88231000, dev_addr 0x1100000, num_pages 63

(here the test program hangs until ctrl-c)

[ 1904.820076] apex 0000:01:00.0: Fault VA: 0xbe96c8
[ 1904.826533] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x5, Simple: 0xbe9
[ 1904.837859] apex 0000:01:00.0: Computed Failing Bus Addr: 0x100004000000000
[ 1904.846581] apex 0000:01:00.0: Fault VA: 0xbe96c8
[ 1904.853128] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x5, Simple: 0xbe9
[ 1904.864475] apex 0000:01:00.0: Computed Failing Bus Addr: 0x100004000000000
[ 1904.873204] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[ 1904.880539] apex 0000:01:00.0: Fault VA: 0xffffffff
[ 1904.887108] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[ 1904.898652] apex 0000:01:00.0: Computed Failing Bus Addr: 0x0
[ 1904.906057] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.921784] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.927701] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.933515] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.939298] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.945065] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff

libedgetpu (verbosity=10)

I :944] EnumerateDevices: vendor:0x1a6e, product:0x89a                                                                                                                                                            
I :944] EnumerateDevices: vendor:0x18d1, product:0x9302                                                                                                                                                           
Test_EdgeTPU[412]: (main:70): Num EdgeTPU Devices: 1                                                                                                                                                              
I :453] No matching device is already opened for shared ownership.                                                                                                                                                
I :944] EnumerateDevices: vendor:0x1a6e, product:0x89a                                                                                                                                                            
I :944] EnumerateDevices: vendor:0x18d1, product:0x9302                                                                                                                                                           
I :104] USB always DFU: False (default)                                                                                                                                                                           
I :126] USB bulk-in queue capacity: default                                                                                                                                                                       
I :65] Performance expectation: Max (default)                                                                                                                                                                     
I :273] Hello Adam!                                                                                                                                                                                               
I :274] Starting in FUCK YEAH mode                                                                                                                                                                                
I :83] Opening /dev/apex_0. read_only=0                                                                                                                                                                           
I :97] mmap_offset=0x0000000000040000, mmap_size=4096                                                                                                                                                             
I :108] Got map addr at 0x0x7f904db000                                                                                                                                                                            
I :97] mmap_offset=0x0000000000044000, mmap_size=4096                                                                                                                                                             
I :108] Got map addr at 0x0x7f89c79000                                                                                                                                                                            
I :97] mmap_offset=0x0000000000048000, mmap_size=4096                                                                                                                                                             
I :108] Got map addr at 0x0x7f89c78000                                                                                                                                                                            
I :240] Offset: 0x00000000000486f0, mmap_reg: 0x7f89c786f0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000000, value:0x0000000000000000                                     
I :269] Read 32 Hacks: offset = 0x00000000000486f0, lower: = 0x0000000000000000 upper: = 0x0000000000000000 value: = 0x0000000000000000 mmap: 0x7f89c786f0                                                        
I :282] Page Fault Address: 0x0000000000000000                                                                                                                                                                    
I :195] Write 32 Hacks: offset = 0x00000000000487a8, value = 0x0000000000000000 mmap=0x7f89c787a8                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000487a8, value: = 0x0000000000000000                                                                                                                                 
I :240] Offset: 0x0000000000048578, mmap_reg: 0x7f89c78578, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010                                     
I :269] Read 32 Hacks: offset = 0x0000000000048578, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78578                                                        
I :282] Page Fault Address: 0x0000000000000000                                                                                                                                                                    
I :136] MmuMapper#Map() : 0000007f89c74000 -> 0000000001000000 (1 pages) flags=00000000.                                                                                                                          
I :55] MapMemory() page-aligned : device_address = 0x0000000001000000                                                                                                                                             
I :169] Queue base : 0x7f89c74000 -> 0x0000000001000000 [4096 bytes]                                                                                                                                              
I :136] MmuMapper#Map() : 0000007f89c75000 -> 0000000001001000 (1 pages) flags=00000000.                                                                                                                          
I :55] MapMemory() page-aligned : device_address = 0x0000000001001000                                                                                                                                             
I :179] Queue status block : 0x7f89c75000 -> 0x0000000001001000 [16 bytes]                                                                                                                                        
I :195] Write 32 Hacks: offset = 0x0000000000048590, value = 0x0000000001000000 mmap=0x7f89c78590                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000048590, value: = 0x0000000001000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000048598, value = 0x0000000001001000 mmap=0x7f89c78598                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000048598, value: = 0x0000000001001000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000485a0, value = 0x0000000000000100 mmap=0x7f89c785a0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000485a0, value: = 0x0000000000000100                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000048568, value = 0x0000000000000005 mmap=0x7f89c78568                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000048568, value: = 0x0000000000000005                                                                                                                                 
I :240] Offset: 0x0000000000048570, mmap_reg: 0x7f89c78570, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000001, value:0x0000000000000001                                     
I :269] Read 32 Hacks: offset = 0x0000000000048570, lower: = 0x0000000000000001 upper: = 0x0000000000000000 value: = 0x0000000000000001 mmap: 0x7f89c78570                                                        
I :282] Page Fault Address: 0x0000000000000000                                                                                                                                                                    
I :240] Offset: 0x00000000000486d0, mmap_reg: 0x7f89c786d0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000000, value:0x0000000000000000                                     
I :269] Read 32 Hacks: offset = 0x00000000000486d0, lower: = 0x0000000000000000 upper: = 0x0000000000000000 value: = 0x0000000000000000 mmap: 0x7f89c786d0                                                        
I :282] Page Fault Address: 0x0000000000000000                                                                                                                                                                    
I :195] Write 32 Hacks: offset = 0x0000000000044018, value = 0x0000000000000001 mmap=0x7f89c79018                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000044018, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000044158, value = 0x0000000000000001 mmap=0x7f89c79158                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000044158, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000044198, value = 0x0000000000000001 mmap=0x7f89c79198                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000044198, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000441d8, value = 0x0000000000000001 mmap=0x7f89c791d8                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000441d8, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000044218, value = 0x0000000000000001 mmap=0x7f89c79218                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000044218, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000048788, value = 0x000000000000007f mmap=0x7f89c78788                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000048788, value: = 0x000000000000007f                                                                                                                                 
I :240] Offset: 0x0000000000048788, mmap_reg: 0x7f89c78788, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x000000000000007f, value:0x000000000000007f                                     
I :269] Read 32 Hacks: offset = 0x0000000000048788, lower: = 0x000000000000007f upper: = 0x0000000000000000 value: = 0x000000000000007f mmap: 0x7f89c78788                                                        
I :282] Page Fault Address: 0x0000000000000000                                                                                                                                                                    
I :195] Write 32 Hacks: offset = 0x00000000000400c0, value = 0x0000000000000001 mmap=0x7f904db0c0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000400c0, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000040150, value = 0x0000000000000001 mmap=0x7f904db150                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000040150, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000040110, value = 0x0000000000000001 mmap=0x7f904db110                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000040110, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000040250, value = 0x0000000000000001 mmap=0x7f904db250                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000040250, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000040298, value = 0x0000000000000001 mmap=0x7f904db298                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000040298, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000402e0, value = 0x0000000000000001 mmap=0x7f904db2e0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000402e0, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000040328, value = 0x0000000000000001 mmap=0x7f904db328                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000040328, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000040190, value = 0x0000000000000001 mmap=0x7f904db190                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000040190, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000401d0, value = 0x0000000000000001 mmap=0x7f904db1d0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000401d0, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x0000000000040210, value = 0x0000000000000001 mmap=0x7f904db210                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x0000000000040210, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000486e8, value = 0x0000000000000000 mmap=0x7f89c786e8                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000486e8, value: = 0x0000000000000000                                                                                                                                 
I :45] Set event fd : event_id:0 -> event_fd:8,                                                                                                                                                                   
I :45] Set event fd : event_id:4 -> event_fd:12,                                                                                                                                                                  
I :62] event_fd=8. Monitor thread begin.                                                                                                                                                                          
I :45] Set event fd : event_id:5 -> event_fd:13,                                                                                                                                                                  
I :62] event_fd=12. Monitor thread begin.                                                                                                                                                                         
I :45] Set event fd : event_id:6 -> event_fd:14,                                                                                                                                                                  
I :62] event_fd=13. Monitor thread begin.                                                                                                                                                                         
I :45] Set event fd : event_id:7 -> event_fd:15,                                                                                                                                                                  
I :62] event_fd=14. Monitor thread begin.                                                                                                                                                                         
I :45] Set event fd : event_id:8 -> event_fd:16,                                                                                                                                                                  
I :62] event_fd=15. Monitor thread begin.                                                                                                                                                                         
I :45] Set event fd : event_id:9 -> event_fd:17,                                                                                                                                                                  
I :62] event_fd=16. Monitor thread begin.                                                                                                                                                                         
I :45] Set event fd : event_id:10 -> event_fd:18,                                                                                                                                                                 
I :62] event_fd=17. Monitor thread begin.                                                                                                                                                                         
I :45] Set event fd : event_id:11 -> event_fd:19,                                                                                                                                                                 
I :62] event_fd=18. Monitor thread begin.                                                                                                                                                                         
I :45] Set event fd : event_id:12 -> event_fd:20,                                                                                                                                                                 
I :62] event_fd=19. Monitor thread begin.                                                                                                                                                                         
I :195] Write 32 Hacks: offset = 0x00000000000486a0, value = 0x000000000000000f mmap=0x7f89c786a0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000486a0, value: = 0x000000000000000f                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000485c0, value = 0x0000000000000001 mmap=0x7f89c785c0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000485c0, value: = 0x0000000000000001                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000486c0, value = 0x0000000000000001 mmap=0x7f89c786c0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000486c0, value: = 0x0000000000000001                                                                                                                                 
I :62] event_fd=20. Monitor thread begin.                                                                                                                                                                         
I :172] Opening device at /dev/apex_0                                                                                                                                                                             
Test_EdgeTPU[412]: (main:75): EdgeTPU - path:type (0=PCIe, 1=USB): /dev/apex_0:0                                                                                                                                  
Test_EdgeTPU[412]: (main:80): Loading Model: /home/kampff/Voight-Kampff/objects_edgetpu.tflite                                                                                                                    
Test_EdgeTPU[412]: (main:82): Model Created
Test_EdgeTPU[412]: (main:89): Options configured: maybe                                                                                                                                                           
Test_EdgeTPU[412]: (main:94): Interpreter Created                                                                                                                                                                 
Test_EdgeTPU[412]: (main:98): Tensors Allocated                                                                                                                                                                   
Test_EdgeTPU[412]: (main:120): NPU inputs: 1 vs 1                                                                                                                                                                 
Test_EdgeTPU[412]: (main:127):  - Input 0 (normalized_input_image_tensor): Dimensionsw: 4                                                                                                                         
Test_EdgeTPU[412]: (main:132):    - Dimension 0: (size: 1)                                                                                                                                                        
Test_EdgeTPU[412]: (main:132):    - Dimension 1: (size: 300)                                                                                                                                                      
Test_EdgeTPU[412]: (main:132):    - Dimension 2: (size: 300)                                                                                                                                                      
Test_EdgeTPU[412]: (main:132):    - Dimension 3: (size: 3)                                                                                                                                                        
Test_EdgeTPU[412]: (main:138): NPU outputs: 4 vs 4                                                                                                                                                                
Test_EdgeTPU[412]: (main:145):  - Ouput 0 (TFLite_Detection_PostProcess): Dimensions: 3                                                                                                                           
Test_EdgeTPU[412]: (main:150):    - Dimension 0: 1)                                                                                                                                                               
Test_EdgeTPU[412]: (main:150):    - Dimension 1: 20)                                                                                                                                                              
Test_EdgeTPU[412]: (main:150):    - Dimension 2: 4)                                                                                                                                                               
Test_EdgeTPU[412]: (main:145):  - Ouput 1 (TFLite_Detection_PostProcess:1): Dimensions: 2                                                                                                                         
Test_EdgeTPU[412]: (main:150):    - Dimension 0: 1)                                                                                                                                                               
Test_EdgeTPU[412]: (main:150):    - Dimension 1: 20)                                                                                                                                                              
Test_EdgeTPU[412]: (main:145):  - Ouput 2 (TFLite_Detection_PostProcess:2): Dimensions: 2                                                                                                                         
Test_EdgeTPU[412]: (main:150):    - Dimension 0: 1)                                                                                                                                                               
Test_EdgeTPU[412]: (main:150):    - Dimension 1: 20)                                                                                                                                                              
Test_EdgeTPU[412]: (main:145):  - Ouput 3 (TFLite_Detection_PostProcess:3): Dimensions: 1                                                                                                                         
Test_EdgeTPU[412]: (main:150):    - Dimension 0: 1)                                                                                                                                                               
Test_EdgeTPU[412]: (main:167): Test Image Loaded                                                                                                                                                                  
Test_EdgeTPU[412]: (main:185): Labels Loaded                                                                                                                                                                      
Test_EdgeTPU[412]: (main:209): Inputs Configured                                                                                                                                                                  
I :47] Adding input "normalized_input_image_tensor" with 270000 bytes.                                                                                                                                            
I :58] Adding output "Squeeze" with 7668 bytes.                                                                                                                                                                   
I :58] Adding output "convert_scores" with 174447 bytes.                                                                                                                                                          
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.                                                                                                                                    
I :310] Request [0]: Submitting P0 request immediately.                                                                                                                                                           
I :373] Request [0]: Need to map parameters.                                                                                                                                                                      
I :136] MmuMapper#Map() : 0000007f5f969000 -> 0000000000000000 (1603 pages) flags=00000002.                                                                                                                       
I :55] MapMemory() page-aligned : device_address = 0x0000000000000000                                                                                                                                             
I :252] Mapped params : Buffer(ptr=0x7f5f969000) -> 0x0000000000000000, 6564224 bytes.                                                                                                                            
I :252] Mapped params : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.                                                                                                                                         
I :387] Request [0]: Need to do parameter-caching.                                                                                                                                                                
I :80] [0] Request constructed.                                                                                                                                                                                   
I :46] InstructionBuffers created.                                                                                                                                                                                
I :653] Created new instruction buffers.                                                                                                                                                                          
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.                                                                                                                                         
I :368] MapDataBuffers() done.                                                                                                                                                                                    
I :187] Linking Parameter: 0x0000000000000000                                                                                                                                                                     
I :136] MmuMapper#Map() : 000000000d9c3000 -> 0000000001004000 (3 pages) flags=00000002.                                                                                                                          
I :55] MapMemory() page-aligned : device_address = 0x0000000001004000                                                                                                                                             
I :223] Mapped "instructions" : Buffer(ptr=0xd9c3000) -> 0x0000000001004000, 11472 bytes. Direction=1                                                                                                             
I :384] MapInstructionBuffers() done.                                                                                                                                                                             
I :481] [0] SetState old=0, new=1.                                                                                                                                                                                
I :393] [0] NotifyRequestSubmitted()                                                                                                                                                                              
I :481] [0] SetState old=1, new=2.                                                                                                                                                                                
I :83] Request[0]: Submitted                                                                                                                                                                                      
I :401] [0] NotifyRequestActive()                                                                                                                                                                                 
I :481] [0] SetState old=2, new=3.                                                                                                                                                                                
I :133] Request[0]: Scheduling DMA[0]                                                                                                                                                                             
I :393] Adding an element to the host queue.                                                                                                                                                                      
I :195] Write 32 Hacks: offset = 0x00000000000485a8, value = 0x0000000000000001 mmap=0x7f89c785a8                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000485a8, value: = 0x0000000000000001                                                                                                                                 
I :75] event_fd=20. Monitor thread got num_events=1.                                                                                                                                                              
I :80] [1] Request constructed.                                                                                                                                                                                   
I :195] Write 32 Hacks: offset = 0x00000000000486c0, value = 0x0000000000000000 mmap=0x7f89c786c0                                                                                                                 
I :113] Adding input "normalized_input_image_tensor" with 270000 bytes.                                                                                                                                           
I :206] ReRead 32 Hacks: offset = 0x00000000000486c0, value: = 0x0000000000000000                                                                                                                                 
I :188] Adding output "Squeeze" with 7668 bytes.                                                                                                                                                                  
I :195] Write 32 Hacks: offset = 0x00000000000486c8, value = 0x0000000000000000 mmap=0x7f89c786c8                                                                                                                 
I :188] Adding output "convert_scores" with 174447 bytes.                                                                                                                                                         
I :206] ReRead 32 Hacks: offset = 0x00000000000486c8, value: = 0x0000000000000001                                                                                                                                 
I :240] Offset: 0x00000000000486f0, mmap_reg: 0x7f89c786f0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000211, value:0x0000000000000211                                     
I :269] Read 32 Hacks: offset = 0x00000000000486f0, lower: = 0x0000000000000211 upper: = 0x0000000000000000 value: = 0x0000000000000211 mmap: 0x7f89c786f0                                                        
I :282] Page Fault Address: 0x0000000000be96c8                                                                                                                                                                    
I :240] Offset: 0x0000000000048700, mmap_reg: 0x7f89c78700, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010                                     
I :269] Read 32 Hacks: offset = 0x0000000000048700, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78700                                                        
I :282] Page Fault Address: 0x0000000000be96c8                                                                                                                                                                    
I :240] Offset: 0x0000000000048700, mmap_reg: 0x7f89c78700, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010                                     
I :269] Read 32 Hacks: offset = 0x0000000000048700, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78700                                                        
I :282] Page Fault Address: 0x0000000000be96c8                                                                                                                                                                    
E :254] HIB Error. hib_error_status = 0000000000000211, hib_first_error_status = 0000000000000010                                                                                                                 
I :75] event_fd=20. Monitor thread got num_events=1.                                                                                                                                                              
I :195] Write 32 Hacks: offset = 0x00000000000486c0, value = 0x0000000000000000 mmap=0x7f89c786c0                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000486c0, value: = 0x0000000000000000                                                                                                                                 
I :195] Write 32 Hacks: offset = 0x00000000000486c8, value = 0x0000000000000000 mmap=0x7f89c786c8                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000486c8, value: = 0x0000000000000000                                                                                                                                 
I :240] Offset: 0x00000000000486f0, mmap_reg: 0x7f89c786f0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000211, value:0x0000000000000211                                     
I :269] Read 32 Hacks: offset = 0x00000000000486f0, lower: = 0x0000000000000211 upper: = 0x0000000000000000 value: = 0x0000000000000211 mmap: 0x7f89c786f0                                                        
I :282] Page Fault Address: 0x0000000000be96c8                                                                                                                                                                    
I :240] Offset: 0x0000000000048700, mmap_reg: 0x7f89c78700, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010                                     
I :269] Read 32 Hacks: offset = 0x0000000000048700, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78700                                                        
I :282] Page Fault Address: 0x0000000000be96c8                                                                                                                                                                    
E :254] HIB Error. hib_error_status = 0000000000000211, hib_first_error_status = 0000000000000010                                                                                                                 
I :46] InstructionBuffers created.                                                                                                                                                                                
I :653] Created new instruction buffers.                                                                                                                                                                          
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.                                                                                                                                         
I :136] MmuMapper#Map() : 0000007f88350000 -> 0000000001080000 (66 pages) flags=00000002.                                                                                                                         
I :55] MapMemory() page-aligned : device_address = 0x0000000001080000                                                                                                                                             
I :223] Mapped "normalized_input_image_tensor" : Buffer(ptr=0x7f88350040) -> 0x0000000001080040, 270000 bytes. Direction=1                                                                                        
I :136] MmuMapper#Map() : 000000000d9c7000 -> 0000000001002000 (2 pages) flags=00000004.                                                                                                                          
I :55] MapMemory() page-aligned : device_address = 0x0000000001002000                                                                                                                                             
I :136] MmuMapper#Map() : 0000007f88272000 -> 0000000001040000 (44 pages) flags=00000004.                                                                                                                         
I :55] MapMemory() page-aligned : device_address = 0x0000000001040000                                                                                                                                             
I :223] Mapped "convert_scores" : Buffer(ptr=0x7f88272000) -> 0x0000000001040000, 176368 bytes. Direction=2                                                                                                       
I :223] Mapped "Squeeze" : Buffer(ptr=0xd9c7000) -> 0x0000000001002000, 7672 bytes. Direction=2                                                                                                                   
I :368] MapDataBuffers() done.                                                                                                                                                                                    
I :93] Linking normalized_input_image_tensor[0]: 0x0000000001080040                                                                                                                                               
I :93] Linking Squeeze[0]: 0x0000000001002000                                                                                                                                                                     
I :93] Linking convert_scores[0]: 0x0000000001040000                                                                                                                                                              
I :136] MmuMapper#Map() : 000000000d9ca000 -> 0000000001008000 (2 pages) flags=00000002.                                                                                                                          
I :55] MapMemory() page-aligned : device_address = 0x0000000001008000                                                                                                                                             
I :136] MmuMapper#Map() : 0000007f88231000 -> 0000000001100000 (63 pages) flags=00000002.                                                                                                                         
I :55] MapMemory() page-aligned : device_address = 0x0000000001100000                                                                                                                                             
I :223] Mapped "instructions" : Buffer(ptr=0x7f88231000) -> 0x0000000001100000, 256992 bytes. Direction=1                                                                                                         
I :223] Mapped "instructions" : Buffer(ptr=0xd9ca000) -> 0x0000000001008000, 7632 bytes. Direction=1                                                                                                              
I :384] MapInstructionBuffers() done.                                                                                                                                                                             
I :481] [1] SetState old=0, new=1.                                                                                                                                                                                
I :393] [1] NotifyRequestSubmitted()                                                                                                                                                                              
I :481] [1] SetState old=1, new=2.                                                                                                                                                                                
I :83] Request[1]: Submitted                                                                                                                                                                                      
I :401] [1] NotifyRequestActive()                                                                                                                                                                                 
I :481] [1] SetState old=2, new=3.                                                                                                                                                                                
I :133] Request[1]: Scheduling DMA[0]                                                                                                                                                                             
I :393] Adding an element to the host queue.                                                                                                                                                                      
I :195] Write 32 Hacks: offset = 0x00000000000485a8, value = 0x0000000000000002 mmap=0x7f89c785a8                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000485a8, value: = 0x0000000000000002                                                                                                                                 
I :133] Request[1]: Scheduling DMA[1]                                                                                                                                                                             
I :393] Adding an element to the host queue.                                                                                                                                                                      
I :195] Write 32 Hacks: offset = 0x00000000000485a8, value = 0x0000000000000003 mmap=0x7f89c785a8                                                                                                                 
I :206] ReRead 32 Hacks: offset = 0x00000000000485a8, value: = 0x0000000000000003

program hangs until killed with ctl-c...
mbrooksx commented 3 years ago

These logs look like what I see as well. The HIB error there (hib_error_status = 0000000000000211) still indicates read failures.

I recently became aware of a new-ish DT Overlay from the Pi team for 32 bit DMA (I found it in this thread for bringing up a USB controller) - pcie-32bit-dma.dtbo. Alas adding it has no effect (and I verified it does cleanly apply).

julled commented 3 years ago

I think this new overlay originated from this issue over here: https://github.com/raspberrypi/linux/issues/4197#issuecomment-794014591

Maybe you can find some ideas on the problem in there ?

comdet commented 3 years ago

Although i cannot help you but, I came here everyday with a hope to see it can work together ^^

mbrooksx commented 3 years ago

@julled - Yeah this new overlay is unrelated (and frankly I can't believe that overlay is actually useful). Thanks for finding the source.

estabanalamedia commented 3 years ago

At Google I/O 2021, the Coral team announced companies they were working with to develop TPU projects, Gumstix was one. Gumstix has a Pixhawk development board which uses Coral and CM4. I asked the Coral team if that unit works (since it uses the PCIe interface to talk to the TPU. There response was:

"Unfortunately, we haven't been able to run the TPU on a 32-bit system (estaban - am assuming this means any?) . Please refer to this issue: https://github.com/google-coral/edgetpu/issues/280 (estaban - this posting). The CM4 has a 32-bit bus, and despite changing both the driver and userspace (see bug for links to GitHub repos with those changes) - the device still is hardcoded to issue 64-bit operations. We expect that the 32-bit host simply omits the upper word, leading to invalid read/writes (as reflected in the HIB error)."

The bottom of the email has the following bug report fields: Status - In-progress Priority - Medium Status Detail - Assigned

Not solved but maybe not dropped...and maybe "no" 32-bit processor can use it, Seems important for an Edge TPU.

R/Estaban

mbrooksx commented 3 years ago

Yes, we worked with GumStix to enable a USB3.0 + CM4 solution. They are using a PCIe to USB3 bridge to accomplish this (since the CM4 only pins out USB2.0). It fully passes Coral CTS and achieves the expected performance for USB3. They have this in both the IP Camera form factor and the Pixhawk.

For USB designs it is possible to use their Upverter platform to create a known-good USB3.0 baseboard for the CM4 or reach out to Coral Sales to discuss the special design considerations for USB3.

Will keep this issue open, however, for PCIe discussion.

mbrooksx commented 3 years ago

Yes, we worked with GumStix to enable a USB3.0 + CM4 solution. They are using a PCIe to USB3 bridge to accomplish this (since the CM4 only pins out USB2.0). It fully passes Coral CTS and achieves the expected performance for USB3. They have this in both the IP Camera form factor and the Pixhawk. For USB designs it is possible to use their Upverter platform to create a known-good USB3.0 baseboard for the CM4 or reach out to Coral Sales to discuss the special design considerations for USB3. Will keep this issue open, however, for PCIe discussion.

The Accelerator Module is USB2 but GumStix is using a USB3 version and need to contact sales to purchase, is there any performance benefits to get USB3 version instead of USB2? Is the USB3 version going to be generally available? Thanks.

The accelerator module supports PCIe, USB3, and USB2. USB3 requires working with us extra design considerations to ensure that the design will work properly. As we validate more designs, we may make this information generally available but we want to ensure it can work across many designs (instead of setting up people for failure).

As for performance, it's a significant difference. I would recommend referring to the CTS outputs. The bottom of the CTS outputs is benchmarks - specifically I'd compare the Dev Board (A53 + PCIe) and Dev Board Mini (A35 + USB2). While the Dev Board is PCIe, it's a more fair comparison then x86+USB (also in CTS outputs) because of the much faster CPU time on the beefier platform.

petergerten commented 3 years ago

Reading this thread it is still no clear to me if there is possibly a hardware limitation which prevents us to have CM4 + Edge TPU via PCIe in the future. Is this a complicate software issue or potentially a hardware dead end?

It seems that Gumstix for example has at least one board with the Edge TPU via PCI-E interface: https://www.gumstix.com/cm4-uprev-ai.html

kklem0 commented 3 years ago

Reading this thread it is still no clear to me if there is possibly a hardware limitation which prevents us to have CM4 + Edge TPU via PCIe in the future. Is this a complicate software issue or potentially a hardware dead end?

It seems that Gumstix for example has at least one board with the Edge TPU via PCI-E interface: https://www.gumstix.com/cm4-uprev-ai.html

According to Gumstix and Google, their solution is CM4 -> PCIe -> USB3 -> Coral TPU, so you still get the fast performance instead of CM4 -> USB2 -> Coral TPU. Gumstix is using ASM1142 USBH1 to Coral Accelerator Module.

As long as this issue is still open, there's no working solution yet for CM4 -> PCIe -> Coral TPU.

petergerten commented 3 years ago

So the U1 component here would be the USB3 bridge I guess? https://www.gumstix.com/media/catalog/product/cache/74c1057f7991b4edb2bc7bdaa94de933/P/K/PKG900000001464_overview.png It would indeed be very interesting to see the USB3 design.

kklem0 commented 3 years ago

So the U1 component here would be the USB3 bridge I guess? https://www.gumstix.com/media/catalog/product/cache/74c1057f7991b4edb2bc7bdaa94de933/P/K/PKG900000001464_overview.png It would indeed be very interesting to see the USB3 design.

I own this device. U1 is TCA6416A, not for USB.

In the one in your link it seems like the USB is connected directly to CM4 with USB2. I also own an updated version of this product that has a SMSC USB2422 and it's only for USB2.

But the other products like the Gumstix Raspberry Pi CM4 PoE Smart Camera is using ASM1142 which is PCIe to USB3 Host Controller, so for now I've put ASM1142 into my project and hope to get more information from Google/Gumstix soon.

Valdiolus commented 3 years ago

Hm, ASM1142 is an interesting solution - I will buy 1 and try it with TPU accelerator module. I am interesting in the inference speed - on the RPI 4B it is 1/2 of the speed of real USB3. Still looking for PCIe solution, I have a PCIe-miniPCIe PCB + miniPCIe TPU module and I can test it.

estabanalamedia commented 3 years ago

Had submitted the PCIe driver issue to Coral/Google (Issue 1011084300), resulting in the earlier posting. As of yesterday (6/2/21), they sent me an email with the following statement "Your issue has been completed". Presumably that could mean the driver has been repaired or the PCIe-USB3 solution has been codified. The Coral site contains some driver troubleshooting information (don't know if it was there before). https://coral.ai/docs/m2/get-started/

browntownington commented 3 years ago

Presumably that could mean the driver has been repaired or the PCIe-USB3 solution has been codified.

From what I can see neither stable or unstable driver have been updated since 5th and 6th Feb. https://packages.cloud.google.com/apt/dists/coral-edgetpu-stable/main/binary-arm64/Release https://packages.cloud.google.com/apt/dists/coral-edgetpu-unstable/main/binary-arm64/Release

The Coral site contains some driver troubleshooting information (don't know if it was there before).

Also I checked the wayback time machine and it looks like troubleshooting doco has been there since the beginning or at least Jan 2020 https://web.archive.org/web/20200206115041if_/https://coral.ai/docs/m2/get-started/#troubleshooting

mathislm commented 3 years ago

Hello, I'm thinking about getting a Coral with my raspberry PI4 (not a Compute Module), the easy way would be to get the usb stick but I live in France and it is not delivered there. Would I get all the problems you guys are getting if I were to plug a mini PCIe Coral on my rasp through something like mini PCIe -> USB -> RPI4 ?

kklem0 commented 3 years ago

Hello, I'm thinking about getting a Coral with my raspberry PI4 (not a Compute Module), the easy way would be to get the usb stick but I live in France and it is not delivered there. Would I get all the problems you guys are getting if I were to plug a mini PCIe Coral on my rasp through something like mini PCIe -> USB -> RPI4 ?

This has nothing to do with this issue. PCIe is used on RPi4 for USB3 and need customization to make use of it. I think your best chance is still to get the USB Accelerator which is using USB3.

timayy commented 3 years ago

Damn. So if I'm reading this correctly, there's currently no way to use any of the PCIe or M.2 Coral Accelerators with a Compute Module 4 (PCIe TPU -> CM4)?

Is it possible to use a PCIe-to-USB3 with a Compute Module 4 and then use the Coral USB Accelerator? Like: USB TPU -> PCIe-to-USB3 -> CM4.

Also, @browntownington those links seem to indicate that the drivers were updated recently! Could be promising?

mbrooksx commented 3 years ago

There hasn't been an update to the gasket driver (which I've now moved to https://github.com/google/gasket-driver) or libedgetpu to enable 32-bit operation required by the CM4. The only way we're aware of to communicate between the CM4 and the TPU is via USB3.

This can be accomplished by starting with a known-good design from Gumstix and customizing in Upverter (based on this board) or by reaching out to Coral Sales to discuss how to build your own USB3 design.

As for the performance of the USB3 + CM4, here is the models_benchmark output on the GumStix PoE camera (this can be compared to the tested CTS runs) - you'll see it significantly outperforms the USB2 design (Dev Board Mini) but is slightly less than the x86 USB3 (due to the more powerful CPU) or Dev Board (due to the slight latency added with the PCIe-USB bridge on the CM4 design).

-----------------------------------------------------
models_benchmark
-----------------------------------------------------
2021-05-13 22:49:44
Running /home/pi/coral/cts/models_benchmark
Run on (4 X 1500 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------
Benchmark                                                                    Time           CPU Iterations
-----------------------------------------------------------------------------------------------------------
BM_MobileNetV1<coral::kEdgeTpu>                                        2474282 ns     123110 ns       6085
BM_MobileNetV1<coral::kCpu>                                          264763991 ns  264650975 ns          3
BM_MobileNetV1_25<coral::kEdgeTpu>                                      992947 ns     133297 ns       7661
BM_MobileNetV1_25<coral::kCpu>                                        15568433 ns   15554960 ns         35
BM_MobileNetV1_50<coral::kEdgeTpu>                                     1334842 ns     125070 ns       6861
BM_MobileNetV1_50<coral::kCpu>                                        57824373 ns   57769407 ns          9
BM_MobileNetV1_75<coral::kEdgeTpu>                                     1688143 ns     123297 ns       6376
BM_MobileNetV1_75<coral::kCpu>                                       139678319 ns  139581462 ns          6
BM_MobileNetV1_L2Norm<coral::kEdgeTpu>                                 3427892 ns    1134517 ns        491
BM_MobileNetV1_L2Norm<coral::kCpu>                                   268103838 ns  267834574 ns          2
BM_MobileNetV2<coral::kEdgeTpu>                                        2713575 ns     117379 ns       1000
BM_MobileNetV2<coral::kCpu>                                          259893815 ns  259716474 ns          3
BM_MobileNetV2INatPlant<coral::kEdgeTpu>                               2957550 ns     140127 ns       1000
BM_MobileNetV2INatPlant<coral::kCpu>                                 272773147 ns  272692758 ns          2
BM_MobileNetV2INatInsect<coral::kEdgeTpu>                              2756209 ns     128306 ns       1000
BM_MobileNetV2INatInsect<coral::kCpu>                                288822651 ns  288533369 ns          3
BM_MobileNetV2INatBird<coral::kEdgeTpu>                                2697566 ns     111712 ns       1000
BM_MobileNetV2INatBird<coral::kCpu>                                  253786802 ns  253703727 ns          3
BM_SsdMobileNetV1<coral::kEdgeTpu>                                   178930700 ns  164907000 ns          4
BM_SsdMobileNetV1<coral::kCpu>                                       817034721 ns  816729350 ns          1
BM_SsdMobileNetV2<coral::kEdgeTpu>                                    19009941 ns    8747667 ns         89
BM_SsdMobileNetV2<coral::kCpu>                                       600252151 ns  599998258 ns          1
BM_FaceSsd<coral::kEdgeTpu>                                           21361053 ns   15720419 ns         56
BM_FaceSsd<coral::kCpu>                                              630894423 ns  630591164 ns          1
BM_InceptionV1<coral::kEdgeTpu>                                        3458645 ns     138411 ns       1000
BM_InceptionV1<coral::kCpu>                                          538686872 ns  538445563 ns          2
BM_InceptionV2<coral::kEdgeTpu>                                       16778940 ns     180806 ns       1000
BM_InceptionV2<coral::kCpu>                                          783592939 ns  783126423 ns          1
BM_InceptionV3<coral::kEdgeTpu>                                       49608231 ns     190925 ns        100
BM_InceptionV3<coral::kCpu>                                         2072519064 ns 2072098252 ns          1
BM_InceptionV4<coral::kEdgeTpu>                                       96992586 ns     200406 ns        100
BM_InceptionV4<coral::kCpu>                                         4660374403 ns 4658627612 ns          1
BM_EfficientNetEdgeTpuSmall<coral::kEdgeTpu>                           5179719 ns     127570 ns       1000
BM_EfficientNetEdgeTpuSmall<coral::kCpu>                             978528500 ns  978162145 ns          1
BM_EfficientNetEdgeTpuMedium<coral::kEdgeTpu>                          9829099 ns     179685 ns       1000
BM_EfficientNetEdgeTpuMedium<coral::kCpu>                           1640263796 ns 1639636846 ns          1
BM_EfficientNetEdgeTpuLarge<coral::kEdgeTpu>                          27017634 ns     184506 ns        100
BM_EfficientNetEdgeTpuLarge<coral::kCpu>                            4013653278 ns 4012972856 ns          1
BM_Deeplab513Mv2Dm1_WithArgMax<coral::kEdgeTpu>                      326646686 ns  303988036 ns          2
BM_Deeplab513Mv2Dm1_WithArgMax<coral::kCpu>                         2587352037 ns 2586685081 ns          1
BM_Deeplab513Mv2Dm05_WithArgMax<coral::kEdgeTpu>                     331540823 ns  314183018 ns          2
BM_Deeplab513Mv2Dm05_WithArgMax<coral::kCpu>                        1283381701 ns 1282998809 ns          1
BM_KerasPostTrainingQuantizedUnetMv2128<coral::kEdgeTpu>              10664440 ns    3564409 ns        211
BM_KerasPostTrainingQuantizedUnetMv2128<coral::kCpu>                 306444764 ns  306451518 ns          2
BM_KerasPostTrainingQuantizedUnetMv2256<coral::kEdgeTpu>             125483203 ns     407458 ns        100
BM_KerasPostTrainingQuantizedUnetMv2256<coral::kCpu>                1430027962 ns 1429929847 ns          1
BM_SsdMobileNetV1FineTunedPet<coral::kEdgeTpu>                        71192741 ns   60151344 ns         10
BM_SsdMobileNetV1FineTunedPet<coral::kCpu>                           645666361 ns  645403571 ns          1
BM_PostTrainingQuantizedTf2KerasMobileNetV1<coral::kEdgeTpu>           2590886 ns     136522 ns       1000
BM_PostTrainingQuantizedTf2KerasMobileNetV1<coral::kCpu>             292732080 ns  292613949 ns          3
BM_PostTrainingQuantizedTf2KerasMobileNetV2<coral::kEdgeTpu>           2839777 ns     134647 ns       1000
BM_PostTrainingQuantizedTf2KerasMobileNetV2<coral::kCpu>             218506495 ns  218423487 ns          3
BM_PostTrainingQuantizedTf2KerasMobileNetV3EdgeTpu<coral::kEdgeTpu>    2997397 ns     122935 ns       1000
BM_PostTrainingQuantizedTf2KerasMobileNetV3EdgeTpu<coral::kCpu>      424009442 ns  423864110 ns          2
BM_SsdLiteMobileDet<coral::kEdgeTpu>                                  19936002 ns    8912055 ns         66
BM_SsdLiteMobileDet<coral::kCpu>                                     834558010 ns  834245273 ns          1
BM_SsdMobileNetV1_NoNms<coral::kEdgeTpu>                              17669110 ns    7159439 ns        102
BM_SsdMobileNetV1_NoNms<coral::kCpu>                                 629323244 ns  602055865 ns          1
BM_SsdMobileNetV2_NoNms<coral::kEdgeTpu>                              24258748 ns    9243972 ns         80
BM_SsdMobileNetV2_NoNms<coral::kCpu>                                 668772697 ns  643773166 ns          1
timayy commented 3 years ago

There hasn't been an update to the gasket driver (which I've now moved to https://github.com/google/gasket-driver) or libedgetpu to enable 32-bit operation required by the CM4. The only way we're aware of to communicate between the CM4 and the TPU is via USB3.

Thanks for the link to the new repo; I'll keep an eye out, @mbrooksx! Is there a fix your team and you (or the RPi team) are currently investigating for this? I think above you mentioned looking at PCIe configuration space (BAR mem)? Sucks its related to memory coherence though.

This can be accomplished by starting with a known-good design from Gumstix and customizing in Upverter (based on this board) or by reaching out to Coral Sales to discuss how to build your own USB3 design.

That reference board and Altium app looks really promising, appreciate the heads up! Those times are good too, not too much of a drop in performance either. PCIe to multiple USB3's might allow for pipelining to multiple TPUs as well for the CM4.

timonsku commented 3 years ago

I completely agree about the potential with the combination. At this point, it looks like a irreparable hardware issue with the antiquated CM4 PCIe module. I have forced all the allocations into simple mapping (see above for more info about this) so that all the virtual addresses are 32-bit, as well as previously setting all reads/writes to 32-bit. However, the device itself (in hardware) makes reads/writes in the coherent cache - all of these read/writes are 64-bits.

For now, the plan is to wait until the office is open so we can use a PCIe analyzer and confirm this hypothesis. But there doesn't appear to be any additional changes that we can do in SW - the device expecting a host to be able to perform 64-bit read/write is built into the hardware.

Any update on this? All the talk about going with a USB3 host controller sounds like there is not much hope left for the PCIe driver to work out? :)

SamueldeFaria commented 3 years ago

Hi,

"...reaching out to Coral Sales to discuss how to build your own USB3 design..." Good luck with that. I've 2 prototype PCIe PCB's made and guess what I had the same problems as you guys. I contacted sales and beside have to sign an NDA, wich would be normal, they also wanted to know what type of product I was designing, who my customer was, volumes of devices to be made, etc, etc... and even with that they mentioned that ..." PHY trace used with USB3 is very peculiar..." so they could not give support. Of course I did not answer the email. Costumer and product are confidential at least in the company I work to... It's an USB3 what the... is not a spaceship what I'm designing... at least they they did not ask for my knickers size. I'm looking into an intel solution instead.

So good luck with support from google

Regards

browntownington commented 3 years ago

Do anyone know if the Rock Pi 4, Rock Pro 64 (RK3399) supports coral pcie/m.2?

I'm guessing if the the CPU supports msi-x it should?