google-coral / edgetpu

Coral issue tracker (and legacy Edge TPU API source)
https://coral.ai
Apache License 2.0
422 stars 124 forks source link

Edge tpu m.2 without MSI-X interrupts? #122

Closed fgervais closed 3 years ago

fgervais commented 4 years ago

I'm trying to run the edge tpu m.2 on a device that does not support MSI-X interrupts but it doesn't seem to work.

I get:

[ 1406.314900] apex 0000:01:00.0: Couldn't initialize interrupts: -28
[ 1406.322735] apex 0000:01:00.0: Not all interrupts were configured

Is there something I can do to work around this issue?

Namburger commented 4 years ago

@fgervais Could you give more info on the device?

uname -a
modinfo gasket
modinfo apex
fgervais commented 4 years ago
root@buildroot:~# uname -a
Linux buildroot 5.6.1-00002-gd947bcc2b6e5-dirty #33 SMP PREEMPT Wed May 20 19:59:11 UTC 2020 aarch64 GNU/Linux

root@buildroot:~# lsmod
Module                  Size  Used by
apex                   16384  5

root@buildroot:~# modinfo /boot/apex.ko 
filename:       /boot/apex.ko
author:         John Joseph <jnjoseph@google.com>
license:        GPL v2
version:        1.0
description:    Google Apex driver
srcversion:     5D5E9910EBA1E634B8F4C69
alias:          pci:v00001AC1d0000089Asv*sd*bc*sc*i*
depends:        
staging:        Y
intree:         Y
name:           apex
vermagic:       5.6.1-00002-gd947bcc2b6e5-dirty SMP preempt mod_unload modversions aarch64
parm:           allow_power_save:int
parm:           allow_sw_clock_gating:int
parm:           allow_hw_clock_gating:int
parm:           bypass_top_level:int
Namburger commented 4 years ago

@fgervais Thanks for sharing, it looks like you are still using the default kernel module. This is most likely due to the kernel already have these modules. In that case, you'll need to blacklist the old modules before installing the new one.

For reference:

$ modinfo gasket
filename:       /lib/modules/5.4.0-bpi-r64/updates/dkms/gasket.ko
author:         Rob Springer <rspringer@google.com>
license:        GPL v2
version:        1.1.3
description:    Google Gasket driver framework
srcversion:     069B6D0F6AE12073F4EAF5D
depends:        
name:           gasket
vermagic:       5.4.0-bpi-r64 SMP preempt aarch64
parm:           dma_bit_mask:int

$ modinfo apex
filename:       /lib/modules/5.4.0-bpi-r64/updates/dkms/apex.ko
author:         John Joseph <jnjoseph@google.com>
license:        GPL v2
version:        1.1
description:    Google Apex driver
srcversion:     508A8A34D57322CEA287D17
alias:          pci:v00001AC1d0000089Asv*sd*bc*sc*i*
depends:        gasket
name:           apex
vermagic:       5.4.0-bpi-r64 SMP preempt aarch64
...

This should fix this issue but we may possibly see other errors :)

fgervais commented 4 years ago

I tried with the driver from the gasket-dkms_1.0-10_all package, I think they are the correct/latest ones. Still no luck though.

[  225.804468] gasket: loading out-of-tree module taints kernel.
[  234.401300] apex 0000:01:00.0: can't enable device: BAR 0 [mem 0x4040100000-0x4040103fff 64bit pref] not claimed
[  234.411559] apex 0000:01:00.0: BAR 2: assigned [mem 0x4040000000-0x40400fffff 64bit pref]
[  234.419780] apex 0000:01:00.0: BAR 0: assigned [mem 0x4040100000-0x4040103fff 64bit pref]
[  234.433261] apex 0000:01:00.0: Couldn't initialize interrupts: -28
[  239.668711] apex 0000:01:00.0: Apex performance not throttled due to temperature
root@buildroot:~# modinfo /boot/gasket.ko 
filename:       /boot/gasket.ko
author:         Rob Springer <rspringer@google.com>
license:        GPL v2
version:        1.1.3
description:    Google Gasket driver framework
srcversion:     DA0710C11C472F2C9237382
depends:        
name:           gasket
vermagic:       5.6.1-00002-gd947bcc2b6e5-dirty SMP preempt mod_unload modversions aarch64
parm:           dma_bit_mask:int

root@buildroot:~# modinfo /boot/apex.ko   
filename:       /boot/apex.ko
author:         John Joseph <jnjoseph@google.com>
license:        GPL v2
version:        1.1
description:    Google Apex driver
srcversion:     508A8A34D57322CEA287D17
alias:          pci:v00001AC1d0000089Asv*sd*bc*sc*i*
depends:        gasket
name:           apex
vermagic:       5.6.1-00002-gd947bcc2b6e5-dirty SMP preempt mod_unload modversions aarch64
parm:           allow_power_save:int
parm:           allow_sw_clock_gating:int
parm:           allow_hw_clock_gating:int
parm:           bypass_top_level:int
parm:           trip_point0_temp:int
parm:           trip_point1_temp:int
parm:           trip_point2_temp:int
parm:           hw_temp_warn1:int
parm:           hw_temp_warn2:int
parm:           hw_temp_warn1_en:bool
parm:           hw_temp_warn2_en:bool
parm:           temp_poll_interval:int
Namburger commented 4 years ago

@fgervais I see, sorry for this issue, Can you also include the output of

lscpu
ls /dev/apex_0

and the whole

dmesg
fgervais commented 4 years ago
root@buildroot:~# lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           4
Model name:                      Cortex-A53
Stepping:                        r0p4
CPU max MHz:                     1600.0000
CPU min MHz:                     700.0000
BogoMIPS:                        50.00
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 cpui
                                 d
root@buildroot:~# ls -l /dev/apex_0 
crwxrwxrwx 1 root root 120, 0 May 25 21:49 /dev/apex_0
root@buildroot:~# dmesg
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[    0.000000] Linux version 5.6.1-00002-gd947bcc2b6e5-dirty (@10346531fdd6) (gcc version 8.3.0 (Buildroot 2019.11-00011-g9bc2596d3f)) #35 SMP PREEMPT Fri May 22 15:29:13 UTC 2020
[    0.000000] Machine model: LS1043A
[    0.000000] earlycon: uart8250 at MMIO 0x00000000021c0500 (options '')
[    0.000000] printk: bootconsole [uart8250] enabled
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: UEFI not found.
[    0.000000] Reserved memory: created DMA memory pool at 0x00000000fb800000, size 4 MiB
[    0.000000] OF: reserved mem: initialized node qman-fqd, compatible id shared-dma-pool
[    0.000000] Reserved memory: created DMA memory pool at 0x00000000f8000000, size 32 MiB
[    0.000000] OF: reserved mem: initialized node qman-pfdr, compatible id shared-dma-pool
[    0.000000] Reserved memory: created DMA memory pool at 0x00000000fa000000, size 16 MiB
[    0.000000] OF: reserved mem: initialized node bman-fbpr, compatible id shared-dma-pool
[    0.000000] cma: Reserved 32 MiB at 0x00000000f6000000
[    0.000000] NUMA: No NUMA configuration found
[    0.000000] NUMA: Faking a node at [mem 0x0000000080000000-0x00000000fbdfffff]
[    0.000000] NUMA: NODE_DATA [mem 0xfb618100-0xfb619fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000080000000-0x00000000bfffffff]
[    0.000000]   DMA32    [mem 0x00000000c0000000-0x00000000fbdfffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x00000000f7ffffff]
[    0.000000]   node   0: [mem 0x00000000fb000000-0x00000000fb7fffff]
[    0.000000]   node   0: [mem 0x00000000fbc00000-0x00000000fbdfffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000000fbdfffff]
[    0.000000] On node 0 totalpages: 494080
[    0.000000]   DMA zone: 4096 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 262144 pages, LIFO batch:63
[    0.000000]   DMA32 zone: 3832 pages used for memmap
[    0.000000]   DMA32 zone: 231936 pages, LIFO batch:63
[    0.000000] psci: probing for conduit method from DT.
[    0.000000] psci: PSCIv1.1 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.1
[    0.000000] percpu: Embedded 30 pages/cpu s84952 r8192 d29736 u122880
[    0.000000] pcpu-alloc: s84952 r8192 d29736 u122880 alloc=30*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
[    0.000000] Detected VIPT I-cache on CPU0
[    0.000000] CPU features: detected: ARM erratum 845719
[    0.000000] Speculative Store Bypass Disable mitigation not required
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 486152
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio,0x21c0500 root=PARTUUID=333ee940-c1d6-4688-a52f-361cb3faf733 rw rootflags=subvol=root_rw rootwait
[    0.000000] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[    0.000000] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] software IO TLB: mapped [mem 0xbbfff000-0xbffff000] (64MB)
[    0.000000] Memory: 1823444K/1976320K available (9212K kernel code, 1040K rwdata, 2424K rodata, 640K init, 388K bss, 120108K reserved, 32768K cma-reserved)
[    0.000000] rcu: Preemptible hierarchical RCU implementation.
[    0.000000] rcu:     RCU event tracing is enabled.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[    0.000000]  Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] GIC: Adjusting CPU interface base to 0x000000000142f000
[    0.000000] GIC: Using split EOI/Deactivate mode
[    0.000000] random: get_random_bytes called from start_kernel+0x300/0x424 with crng_init=0
[    0.000000] arch_timer: Enabling global workaround for Freescale erratum a005858
[    0.000000] arch_timer: CPU0: Trapping CNTVCT access
[    0.000000] arch_timer: cp15 timer(s) running at 25.00MHz (phys).
[    0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x5c40939b5, max_idle_ns: 440795202646 ns
[    0.000002] sched_clock: 56 bits at 25MHz, resolution 40ns, wraps every 4398046511100ns
[    0.008417] Console: colour dummy device 80x25
[    0.012926] Calibrating delay loop (skipped), value calculated using timer frequency.. 50.00 BogoMIPS (lpj=100000)
[    0.023334] pid_max: default: 32768 minimum: 301
[    0.028059] LSM: Security Framework initializing
[    0.032779] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.040221] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.080146] rcu: Hierarchical SRCU implementation.
[    0.093115] EFI services will not be available.
[    0.105685] smp: Bringing up secondary CPUs ...
[    0.142473] Detected VIPT I-cache on CPU1
[    0.142500] arch_timer: CPU1: Trapping CNTVCT access
[    0.142505] CPU1: Booted secondary processor 0x0000000001 [0x410fd034]
[    0.174484] Detected VIPT I-cache on CPU2
[    0.174499] arch_timer: CPU2: Trapping CNTVCT access
[    0.174504] CPU2: Booted secondary processor 0x0000000002 [0x410fd034]
[    0.206525] Detected VIPT I-cache on CPU3
[    0.206540] arch_timer: CPU3: Trapping CNTVCT access
[    0.206545] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
[    0.206598] smp: Brought up 1 node, 4 CPUs
[    0.257426] SMP: Total of 4 processors activated.
[    0.262155] CPU features: detected: 32-bit EL0 Support
[    0.267330] CPU features: detected: CRC32 instructions
[    0.279013] CPU: All CPU(s) started at EL2
[    0.283141] alternatives: patching kernel code
[    0.288575] devtmpfs: initialized
[    0.295131] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.304933] futex hash table entries: 1024 (order: 4, 65536 bytes, linear)
[    0.312576] xor: measuring software checksum speed
[    0.357428]    8regs     :  3166.000 MB/sec
[    0.401663]    32regs    :  3632.000 MB/sec
[    0.445899]    arm64_neon:  2959.000 MB/sec
[    0.450102] xor: using function: 32regs (3632.000 MB/sec)
[    0.455848] thermal_sys: Registered thermal governor 'step_wise'
[    0.456221] DMI not present or invalid.
[    0.466266] NET: Registered protocol family 16
[    0.479436] DMA: preallocated 256 KiB pool for atomic allocations
[    0.485567] audit: initializing netlink subsys (disabled)
[    0.491085] audit: type=2000 audit(0.432:1): state=initialized audit_enabled=0 res=1
[    0.491478] hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
[    0.505737] ASID allocator initialised with 65536 entries
[    0.512383] Machine: LS1043A
[    0.516498] SoC family: QorIQ LS1043A
[    0.520171] SoC ID: svr:0x87920011, Revision: 1.1
[    0.536676] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.543417] HugeTLB registered 32.0 MiB page size, pre-allocated 0 pages
[    0.550153] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.556888] HugeTLB registered 64.0 KiB page size, pre-allocated 0 pages
[    0.565980] cryptd: max_cpu_qlen set to 1000
[    0.638880] raid6: neonx8   gen()  2833 MB/s
[    0.710938] raid6: neonx8   xor()  2105 MB/s
[    0.783002] raid6: neonx4   gen()  2884 MB/s
[    0.855059] raid6: neonx4   xor()  2084 MB/s
[    0.927133] raid6: neonx2   gen()  2744 MB/s
[    0.999487] raid6: neonx2   xor()  1911 MB/s
[    1.071852] raid6: neonx1   gen()  2386 MB/s
[    1.144205] raid6: neonx1   xor()  1641 MB/s
[    1.216563] raid6: int64x8  gen()  2036 MB/s
[    1.288916] raid6: int64x8  xor()  1091 MB/s
[    1.361279] raid6: int64x4  gen()  2223 MB/s
[    1.433632] raid6: int64x4  xor()  1122 MB/s
[    1.505993] raid6: int64x2  gen()  1890 MB/s
[    1.578352] raid6: int64x2  xor()   999 MB/s
[    1.650725] raid6: int64x1  gen()  1593 MB/s
[    1.723083] raid6: int64x1  xor()   794 MB/s
[    1.727372] raid6: using algorithm neonx4 gen() 2884 MB/s
[    1.732797] raid6: .... xor() 2084 MB/s, rmw enabled
[    1.737786] raid6: using neon recovery algorithm
[    1.744008] iommu: Default domain type: Translated 
[    1.748997] vgaarb: loaded
[    1.751906] SCSI subsystem initialized
[    1.755786] usbcore: registered new interface driver usbfs
[    1.761326] usbcore: registered new interface driver hub
[    1.766711] usbcore: registered new device driver usb
[    1.772040] imx-i2c 2180000.i2c: can't get pinctrl, bus recovery not supported
[    1.779525] i2c i2c-0: IMX I2C adapter registered
[    1.784285] i2c i2c-0: using dma0chan16 (tx) and dma0chan17 (rx) for DMA transfers
[    1.791998] imx-i2c 2190000.i2c: can't get pinctrl, bus recovery not supported
[    1.799328] i2c i2c-1: IMX I2C adapter registered
[    1.804692] clocksource: Switched to clocksource arch_sys_counter
[    1.866917] NET: Registered protocol family 2
[    1.871683] tcp_listen_portaddr_hash hash table entries: 1024 (order: 2, 16384 bytes, linear)
[    1.880275] TCP established hash table entries: 16384 (order: 5, 131072 bytes, linear)
[    1.888314] TCP bind hash table entries: 16384 (order: 6, 262144 bytes, linear)
[    1.895857] TCP: Hash tables configured (established 16384 bind 16384)
[    1.902506] UDP hash table entries: 1024 (order: 3, 32768 bytes, linear)
[    1.909291] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes, linear)
[    1.916594] NET: Registered protocol family 1
[    1.921000] PCI: CLS 0 bytes, default 64
[    1.925445] hw perfevents: enabled with armv8_pmuv3 PMU driver, 7 counters available
[    1.938314] workingset: timestamp_bits=44 max_order=19 bucket_order=0
[    1.945278] fuse: init (API version 7.31)
[    1.949508] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 250)
[    1.957031] io scheduler mq-deadline registered
[    1.961590] io scheduler kyber registered
[    1.966582] layerscape-pcie 3400000.pcie: host bridge /soc/pcie@3400000 ranges:
[    1.973963] layerscape-pcie 3400000.pcie:       IO 0x4000010000..0x400001ffff -> 0x0000000000
[    1.982548] layerscape-pcie 3400000.pcie:      MEM 0x4040000000..0x407fffffff -> 0x0040000000
[    1.991228] layerscape-pcie 3400000.pcie: PCI host bridge to bus 0000:00
[    1.997970] pci_bus 0000:00: root bus resource [bus 00-ff]
[    2.003486] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
[    2.009701] pci_bus 0000:00: root bus resource [mem 0x4040000000-0x407fffffff] (bus address [0x40000000-0x7fffffff])
[    2.020299] pci 0000:00:00.0: [1957:8080] type 01 class 0x060400
[    2.026360] pci 0000:00:00.0: reg 0x38: [mem 0x4040000000-0x40400007ff pref]
[    2.033489] pci 0000:00:00.0: supports D1 D2
[    2.037781] pci 0000:00:00.0: PME# supported from D0 D1 D2 D3hot
[    2.045061] pci 0000:01:00.0: [1ac1:089a] type 00 class 0x0000ff
[    2.051179] pci 0000:01:00.0: reg 0x10: [mem 0x4040100000-0x4040103fff 64bit pref]
[    2.058823] pci 0000:01:00.0: reg 0x18: [mem 0x4040200000-0x40402fffff 64bit pref]
[    2.077744] pci 0000:00:00.0: BAR 9: assigned [mem 0x4040000000-0x40401fffff 64bit pref]
[    2.085887] pci 0000:00:00.0: BAR 6: assigned [mem 0x4040200000-0x40402007ff pref]
[    2.093504] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
[    2.098762] pci 0000:00:00.0:   bridge window [mem 0x4040000000-0x40401fffff 64bit pref]
[    2.115405] bman_portal 508000000.bman-portal: Portal initialised, cpu 0
[    2.122223] bman_portal 508010000.bman-portal: Portal initialised, cpu 1
[    2.129042] bman_portal 508020000.bman-portal: Portal initialised, cpu 2
[    2.135858] bman_portal 508030000.bman-portal: Portal initialised, cpu 3
[    2.143061] qman_portal 500000000.qman-portal: Portal initialised, cpu 0
[    2.149885] qman_portal 500010000.qman-portal: Portal initialised, cpu 1
[    2.156716] qman_portal 500020000.qman-portal: Portal initialised, cpu 2
[    2.163913] qman_portal 500030000.qman-portal: Portal initialised, cpu 3
[    2.197786] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    2.205333] printk: console [ttyS0] disabled
[    2.209688] 21c0500.serial: ttyS0 at MMIO 0x21c0500 (irq = 38, base_baud = 25000000) is a 16550A
[    2.218549] printk: console [ttyS0] enabled
[    2.226924] printk: bootconsole [uart8250] disabled
[    2.237184] 21c0600.serial: ttyS1 at MMIO 0x21c0600 (irq = 38, base_baud = 25000000) is a 16550A
[    2.246414] 21d0500.serial: ttyS2 at MMIO 0x21d0500 (irq = 39, base_baud = 25000000) is a 16550A
[    2.255648] 21d0600.serial: ttyS3 at MMIO 0x21d0600 (irq = 39, base_baud = 25000000) is a 16550A
[    2.271105] loop: module loaded
[    2.274951] libphy: Fixed MDIO Bus: probed
[    2.281049] libphy: Freescale XGMAC MDIO Bus: probed
[    2.286455] libphy: Freescale XGMAC MDIO Bus: probed
[    2.293108] libphy: Freescale XGMAC MDIO Bus: probed
[    2.299136] libphy: Freescale XGMAC MDIO Bus: probed
[    2.305032] libphy: Freescale XGMAC MDIO Bus: probed
[    2.310906] libphy: Freescale XGMAC MDIO Bus: probed
[    2.316945] libphy: Freescale XGMAC MDIO Bus: probed
[    2.322959] libphy: Freescale XGMAC MDIO Bus: probed
[    2.328860] libphy: Freescale XGMAC MDIO Bus: probed
[    2.348867] fsl_dpaa_mac 1ae0000.ethernet: of_get_mac_address(/soc/fman@1a00000/ethernet@e0000) failed
[    2.358180] fsl_dpaa_mac: probe of 1ae0000.ethernet failed with error -22
[    2.365062] fsl_dpaa_mac 1ae2000.ethernet: of_get_mac_address(/soc/fman@1a00000/ethernet@e2000) failed
[    2.374370] fsl_dpaa_mac: probe of 1ae2000.ethernet failed with error -22
[    2.381263] fsl_dpaa_mac 1ae8000.ethernet: of_get_mac_address(/soc/fman@1a00000/ethernet@e8000) failed
[    2.390571] fsl_dpaa_mac: probe of 1ae8000.ethernet failed with error -22
[    2.397455] fsl_dpaa_mac 1aea000.ethernet: of_get_mac_address(/soc/fman@1a00000/ethernet@ea000) failed
[    2.406763] fsl_dpaa_mac: probe of 1aea000.ethernet failed with error -22
[    2.413751] dwc3 2f00000.usb3: Failed to get clk 'ref': -2
[    2.419477] dwc3 3000000.usb3: Failed to get clk 'ref': -2
[    2.425170] dwc3 3100000.usb3: Failed to get clk 'ref': -2
[    2.431764] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    2.438312] ehci-pci: EHCI PCI platform driver
[    2.443119] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[    2.448613] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 1
[    2.456362] xhci-hcd xhci-hcd.0.auto: hcc params 0x0220f66d hci version 0x100 quirks 0x0000000002010010
[    2.465778] xhci-hcd xhci-hcd.0.auto: irq 46, io mem 0x02f00000
[    2.472250] hub 1-0:1.0: USB hub found
[    2.476017] hub 1-0:1.0: 1 port detected
[    2.480135] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[    2.485623] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 2
[    2.493280] xhci-hcd xhci-hcd.0.auto: Host supports USB 3.0 SuperSpeed
[    2.499840] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    2.508247] hub 2-0:1.0: USB hub found
[    2.512011] hub 2-0:1.0: 1 port detected
[    2.516162] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller
[    2.521653] xhci-hcd xhci-hcd.1.auto: new USB bus registered, assigned bus number 3
[    2.529388] xhci-hcd xhci-hcd.1.auto: hcc params 0x0220f66d hci version 0x100 quirks 0x0000000002010010
[    2.538805] xhci-hcd xhci-hcd.1.auto: irq 47, io mem 0x03000000
[    2.545191] hub 3-0:1.0: USB hub found
[    2.548958] hub 3-0:1.0: 1 port detected
[    2.553057] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller
[    2.558548] xhci-hcd xhci-hcd.1.auto: new USB bus registered, assigned bus number 4
[    2.566207] xhci-hcd xhci-hcd.1.auto: Host supports USB 3.0 SuperSpeed
[    2.572768] usb usb4: We don't know the algorithms for LPM for this host, disabling LPM.
[    2.581182] hub 4-0:1.0: USB hub found
[    2.584946] hub 4-0:1.0: 1 port detected
[    2.589131] usbcore: registered new interface driver usb-storage
[    2.595226] using random self ethernet address
[    2.599667] using random host ethernet address
[    2.604378] usb0: HOST MAC 62:ba:fe:12:c5:be
[    2.608680] usb0: MAC 9e:0f:ae:7e:2d:ea
[    2.612527] g_ncm gadget: NCM Gadget
[    2.616100] g_ncm gadget: g_ncm ready
[    2.622567] rtc-pcf85063 0-0051: registered as rtc0
[    2.663391] imx2-wdt 2ad0000.wdog: timeout 60 sec (nowayout=0)
[    2.669725] qoriq_cpufreq: Freescale QorIQ CPU frequency scaling driver
[    2.676578] sdhci: Secure Digital Host Controller Interface driver
[    2.682762] sdhci: Copyright(c) Pierre Ossman
[    2.687115] sdhci-pltfm: SDHCI platform and OF driver helper
[    2.718459] mmc0: SDHCI controller on 1560000.esdhc [1560000.esdhc] using ADMA 64-bit
[    2.727039] caam 1700000.crypto: Linux CAAM Queue I/F driver initialised
[    2.733761] caam 1700000.crypto: device ID = 0x0a12060000000000 (Era 8)
[    2.740376] caam 1700000.crypto: job rings = 3, qi = 1
[    2.756296] caam algorithms registered in /proc/crypto
[    2.763614] caam 1700000.crypto: caam pkc algorithms registered in /proc/crypto
[    2.771330] caam_jr 1710000.jr: registering rng-caam
[    2.783105] caam 1700000.crypto: algorithms registered in /proc/crypto
[    2.786014] mmc0: new high speed MMC card at address 0001
[    2.795536] mmcblk0: mmc0:0001 TA2964 19.4 GiB 
[    2.800290] mmcblk0boot0: mmc0:0001 TA2964 partition 1 4.00 MiB
[    2.806479] mmcblk0boot1: mmc0:0001 TA2964 partition 2 4.00 MiB
[    2.812524] mmcblk0rpmb: mmc0:0001 TA2964 partition 3 4.00 MiB, chardev (248:0)
[    2.812744] usbcore: registered new interface driver usbhid
[    2.821615]  mmcblk0: p1 p2 p3
[    2.825396] usbhid: USB HID core driver
[    2.832340] optee: probing for conduit method.
[    2.836785] optee: api uid mismatch
[    2.840270] optee: probe of firmware:optee failed with error -22
[    2.846935] Initializing XFRM netlink socket
[    2.851281] NET: Registered protocol family 10
[    2.856279] Segment Routing with IPv6
[    2.859996] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[    2.866261] NET: Registered protocol family 17
[    2.870714] NET: Registered protocol family 15
[    2.875179] Bridge firewalling registered
[    2.879420] registered taskstats version 1
[    2.883896] Btrfs loaded, crc32c=crc32c-generic
[    2.891906] rtc-pcf85063 0-0051: setting system clock to 2020-05-26T16:56:17 UTC (1590512177)
[    2.905992] fuseblk: Unknown parameter 'subvol'
[    2.911162] BTRFS: device fsid ea316be5-36b9-4d38-aaac-68f1fad778b9 devid 1 transid 12085 /dev/root scanned by swapper/0 (1)
[    2.923309] BTRFS info (device mmcblk0p3): disk space caching is enabled
[    2.930011] BTRFS info (device mmcblk0p3): has skinny extents
[    2.960842] BTRFS info (device mmcblk0p3): enabling ssd optimizations
[    2.972534] VFS: Mounted root (btrfs filesystem) on device 0:20.
[    2.978605] devtmpfs: mounted
[    2.981759] Freeing unused kernel memory: 640K
[    2.992735] Run /sbin/init as init process
[    2.996832]   with arguments:
[    2.996834]     /sbin/init
[    2.996836]   with environment:
[    2.996838]     HOME=/
[    2.996840]     TERM=linux
[    3.071323] random: fast init done
[    3.354477] systemd[1]: systemd 244 running in system mode. (-PAM -AUDIT -SELINUX -IMA -APPARMOR -SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT -GNUTLS -ACL -XZ -LZ4 -SECCOMP +BLKID -)
[    3.376095] systemd[1]: Detected architecture arm64.
[    3.412830] systemd[1]: Set hostname to <buildroot>.
[    3.452283] systemd-gpt-auto-generator[445]: Failed to determine block device of root file system: No such file or directory
[    3.464173] systemd[440]: /usr/lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.
[    3.693260] systemd[1]: /usr/lib/systemd/system/ninfod.service:3: Invalid URL, ignoring: ninfod(8)
[    3.719725] systemd[1]: /usr/lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket \xe2\x86\x92.
[    3.749272] random: systemd: uninitialized urandom read (16 bytes read)
[    3.755986] systemd[1]: system-getty.slice: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
[    3.768335] systemd[1]: (This warning is only shown for the first unit using IP firewalling.)
[    3.778657] systemd[1]: Created slice system-getty.slice.
[    3.800783] random: systemd: uninitialized urandom read (16 bytes read)
[    3.808010] systemd[1]: Created slice system-serial\x2dgetty.slice.
[    3.828779] random: systemd: uninitialized urandom read (16 bytes read)
[    3.835532] systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
[    3.860862] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[    3.884873] systemd[1]: Reached target Paths.
[    3.904771] systemd[1]: Reached target Remote File Systems.
[    3.924764] systemd[1]: Reached target Slices.
[    3.944774] systemd[1]: Reached target Swap.
[    3.964916] systemd[1]: Listening on initctl Compatibility Named Pipe.
[    3.989209] systemd[1]: Listening on Journal Audit Socket.
[    4.008997] systemd[1]: Listening on Journal Socket (/dev/log).
[    4.033028] systemd[1]: Listening on Journal Socket.
[    4.053101] systemd[1]: Listening on Network Service Netlink Socket.
[    4.081881] systemd[1]: Listening on udev Control Socket.
[    4.104929] systemd[1]: Listening on udev Kernel Socket.
[    4.127141] systemd[1]: Mounting Huge Pages File System...
[    4.151509] systemd[1]: Mounting POSIX Message Queue File System...
[    4.175241] systemd[1]: Mounting Kernel Debug File System...
[    4.199063] systemd[1]: Mounting Temporary Directory (/tmp)...
[    4.220946] systemd[1]: Condition check resulted in Create list of static device nodes for the current kernel being skipped.
[    4.232318] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
[    4.244319] systemd[1]: Starting Journal Service...
[    4.265286] systemd[1]: Condition check resulted in Load Kernel Modules being skipped.
[    4.275696] systemd[1]: Mounting FUSE Control File System...
[    4.299333] systemd[1]: Mounting Kernel Configuration File System...
[    4.323395] systemd[1]: Starting Remount Root and Kernel File Systems...
[    4.347174] systemd[1]: Starting Apply Kernel Variables...
[    4.371298] systemd[1]: Starting Wait Until Kernel Time Synchronized...
[    4.395088] systemd[1]: Starting Create Static Device Nodes in /dev...
[    4.419156] systemd[1]: Starting udev Coldplug all Devices...
[    4.444946] systemd[1]: Started Journal Service.
[    4.489976] BTRFS info (device mmcblk0p3): disk space caching is enabled
[    4.684796] systemd-journald[457]: Received client request to flush runtime journal.
[    5.809148] BTRFS info (device mmcblk0p3): device fsid ea316be5-36b9-4d38-aaac-68f1fad778b9 devid 1 moved old:/dev/root new:/dev/mmcblk0p3
[    6.103296] EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
[    6.111414] ext4 filesystem being mounted at /boot supports timestamps until 2038 (0x7fffffff)
[    8.984717] random: crng init done
[    8.988118] random: 7 urandom warning(s) missed due to ratelimiting
Namburger commented 4 years ago

@fgervais By any chance you can share the the device that you are attaching the m.2 module to? It seems that apex is working now since you are able to see it in /dev. I'm seeing some weird pcie logs:

[    1.966582] layerscape-pcie 3400000.pcie: host bridge /soc/pcie@3400000 ranges:
[    1.973963] layerscape-pcie 3400000.pcie:       IO 0x4000010000..0x400001ffff -> 0x0000000000
[    1.982548] layerscape-pcie 3400000.pcie:      MEM 0x4040000000..0x407fffffff -> 0x0040000000
[    1.991228] layerscape-pcie 3400000.pcie: PCI host bridge to bus 0000:00
[    1.997970] pci_bus 0000:00: root bus resource [bus 00-ff]
[    2.003486] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
[    2.009701] pci_bus 0000:00: root bus resource [mem 0x4040000000-0x407fffffff] (bus address [0x40000000-0x7fffffff])
[    2.020299] pci 0000:00:00.0: [1957:8080] type 01 class 0x060400
[    2.026360] pci 0000:00:00.0: reg 0x38: [mem 0x4040000000-0x40400007ff pref]
[    2.033489] pci 0000:00:00.0: supports D1 D2
[    2.037781] pci 0000:00:00.0: PME# supported from D0 D1 D2 D3hot
[    2.045061] pci 0000:01:00.0: [1ac1:089a] type 00 class 0x0000ff
[    2.051179] pci 0000:01:00.0: reg 0x10: [mem 0x4040100000-0x4040103fff 64bit pref]
[    2.058823] pci 0000:01:00.0: reg 0x18: [mem 0x4040200000-0x40402fffff 64bit pref]
[    2.077744] pci 0000:00:00.0: BAR 9: assigned [mem 0x4040000000-0x40401fffff 64bit pref]
[    2.085887] pci 0000:00:00.0: BAR 6: assigned [mem 0x4040200000-0x40402007ff pref]
[    2.093504] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
[    2.098762] pci 0000:00:00.0:   bridge window [mem 0x4040000000-0x40401fffff 64bit pref]

Can you add this to your kernel argument: gasket.dma_bit_mask=32

fgervais commented 4 years ago

It's attached to a ls1043a processor.

It seems to fail because the driver fails to register MSI-X interrupts since the cpu doesn't support it. The status is then set to GASKET_STATUS_LAMED which according to the driver source code, "is not fatal".

However when I try to run inference, it fails at gasket_mmap_has_permissions() because the device is not GASKET_STATUS_ALIVE and I run the python inference as a user.

If I force allow this, I get up to gasket_ioctl_check_permissions() and get refused again because !GASKET_STATUS_ALIVE.

If I force allow this, I get stuck forever here:

mmap(NULL, 1904640, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff96c8d000
munmap(0xffff96e5e000, 1904640)         = 0
brk(0x14eab000)                         = 0x14eab000
brk(0x14edc000)                         = 0x14edc000
munmap(0xffff96c8d000, 1904640)         = 0
write(1, "----INFERENCE TIME----\n", 23----INFERENCE TIME----
) = 23
write(1, "Note: The first inference on Edg"..., 106Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
) = 106
clock_gettime(CLOCK_MONOTONIC, {tv_sec=85, tv_nsec=862436800}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1590443364, tv_nsec=559288720}) = 0
ioctl(5, _IOC(_IOC_WRITE, 0xdc, 0xc, 0x28), 0xffffcb59b388) = 0
ioctl(4, _IOC(_IOC_WRITE, 0x7f, 0, 0x10), 0xffffcb59bc18) = 0
ioctl(5, _IOC(_IOC_WRITE, 0xdc, 0xc, 0x28), 0xffffcb59af88) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1590443364, tv_nsec=586945720}) = 0
ioctl(5, _IOC(_IOC_WRITE, 0xdc, 0xc, 0x28), 0xffffcb59b0c8) = 0
ioctl(5, _IOC(_IOC_WRITE, 0xdc, 0xc, 0x28), 0xffffcb59b0c8) = 0
ioctl(5, _IOC(_IOC_WRITE, 0xdc, 0xc, 0x28), 0xffffcb59b038) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1590443364, tv_nsec=627863040}) = 0
futex(0xffffcb59c368, FUTEX_WAIT_PRIVATE, 0, NULL
fgervais commented 4 years ago

I tried with the following but it doesn't seem to change the behavior

insmod gasket.ko dma_bit_mask=32
mbrooksx commented 4 years ago

Hello,

The driver itself certainly expects MSI-X interrupts. Looking at the source, you can change the configuration to DEVICE_MANAGED (https://coral.googlesource.com/linux-imx/+/refs/heads/dkms/drivers/staging/gasket/apex_driver.c#1139). I briefly tested this but the system has MSI-X (and I still see it in capabilities when running lspci -vvv).

It's worth trying this, I simply extracted the gasket-dkms deb and made that quick change and then rebuilt the deb.

fgervais commented 4 years ago

@mbrooksx I gave it a try and the driver loads correctly but still freezes when trying to do an inference:

(.venv) fgervais@buildroot:~/coral/tflite/python/examples/classification$ python classify_image.py --model models/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels models/inat_bird_labels.txt --input images/parrot.jpg
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.

It stops there forever.

On the kernel side I get the following messages (I enabled a bit more dev_dbg() messages):

[   64.834615] gasket: loading out-of-tree module taints kernel.
[   64.844638] gasket: Loading apex driver version 1.1
[   64.849698] apex 0000:01:00.0: can't enable device: BAR 0 [mem 0x4040100000-0x4040103fff 64bit pref] not claimed
[   64.859894] apex 0000:01:00.0: BAR 2: assigned [mem 0x4040000000-0x40400fffff 64bit pref]
[   64.868099] apex 0000:01:00.0: BAR 0: assigned [mem 0x4040100000-0x4040103fff 64bit pref]
[   64.876317] apex 0000:01:00.0: add PCI gasket device
[   64.881283] gasket: Allocating a Gasket device, parent 0000:01:00.0.
[   64.888043] apex 0000:01:00.0: enabling device
[   64.892503] apex 0000:01:00.0: Initializing page table 0.
[   69.940725] apex 0000:01:00.0: Apex performance not throttled due to temperature
[   82.474173] apex 0000:01:00.0: Attempting to open with tgid 573 (python) (f_mode: 0037, fmode_write: 2 is_root: 0)
[   82.484551] apex 0000:01:00.0: Current owner open count (owning tgid 0): 0.
[   82.492539] apex 0000:01:00.0: Device owner is now tgid 573
[   82.498143] apex 0000:01:00.0: New open count (owning tgid 573): 1
[   82.504391] apex 0000:01:00.0: Attempting to open with tgid 573 (python) (f_mode: 0037, fmode_write: 2 is_root: 0)
[   82.514903] apex 0000:01:00.0: Current owner open count (owning tgid 573): 1.
[   82.522068] apex 0000:01:00.0: New open count (owning tgid 573): 2
[   82.528282] apex 0000:01:00.0: Attempting to open with tgid 573 (python) (f_mode: 0037, fmode_write: 2 is_root: 0)
[   82.538635] apex 0000:01:00.0: Current owner open count (owning tgid 573): 2.
[   82.545776] apex 0000:01:00.0: New open count (owning tgid 573): 3
[   82.551989] apex 0000:01:00.0: Attempting to open with tgid 573 (python) (f_mode: 0037, fmode_write: 2 is_root: 0)
[   82.562340] apex 0000:01:00.0: Current owner open count (owning tgid 573): 3.
[   82.569477] apex 0000:01:00.0: New open count (owning tgid 573): 4
[   82.575725] apex 0000:01:00.0: Attempting to open with tgid 573 (python) (f_mode: 0037, fmode_write: 2 is_root: 0)
[   82.586077] apex 0000:01:00.0: Current owner open count (owning tgid 573): 4.
[   82.593217] apex 0000:01:00.0: New open count (owning tgid 573): 5
mbrooksx commented 4 years ago

@fgervais - yes I verified that the driver fails with device managed. In fact, looking closer at that concept a ISR would need to be implemented in the Apex driver - which it isn't. I'm fairly confident that it can't be implemented because there are MSIX dependent device registers (such as https://coral.googlesource.com/linux-imx/+/refs/heads/dkms/drivers/staging/gasket/apex_driver.c#117).

Unfortunately, MSI-X is required for using the Edge TPU with PCIe.

fgervais commented 4 years ago

@mbrooksx I'm thinking there could be registers for MSI too but that are not defined in the driver code?

Is this something you could verify on your side?

fgervais commented 4 years ago

@mbrooksx those look particularly interesting:

APEX_BAR2_REG_KERNEL_WIRE_INT_PENDING_BIT_ARRAY = 0x48778,
APEX_BAR2_REG_KERNEL_WIRE_INT_MASK_ARRAY = 0x48780,

Is WIRE_INT an alternate interrupt I could use?

usbguru commented 3 years ago

It might be possible to re-write the driver to use polling instead of interrupts, but would it be worth the effort to bring up a new driver so the edge TPU can run at significantly reduced performance?

NeuerUser commented 3 years ago

Hello all I find it difficult to understand what exactly the requirement is, given that I do not know much about PCIe standard. According to the Coral-M2-dual-edgetpu-datasheet:

All systems require support for MSI-X as defined in the PCI 3.0 specification

I have two potential hosts for the dual M.2 card, but cannot find if they would work with it. These are the two:

  1. Rock Pi N10, based on RK3399pro (with a M.2 E-Key slot): According to specs it supports PCIe 2.1 and has support for legacy, MSI and MSI-x interrupts:

    PCIe  One PCIe port in RK3399Pro  Compatible with PCI Express Base Specification Revision 2.1  Dual operation mode: Root Complex(RC)and End Point(EP)  Maximum link width is 4, single bi-directional Link interface  Support 2.5Gbps serial data transmission rate per lane per direction  Support Single Physical PCI Functions in Endpoint Mode  Support Legacy Interrupt and MSI and MSI-X interrupt

  2. HP Microserver Gen 8 with Intel Xeon E3-1220L v2: According to HP datasheet, the supported PCIe version is either 2.0 or 3.0 depending on CPU (Xeon should be PCIe 3.0). I don't find any info on MSI-X, though.

1 | PCIe 3.0/2.0 | x16 | x16 | Low Profile

Processor dependent. Xeon is PCIe 3.0. Celeron and pentium are PCIe 2.0.

The Microserver would also require two adapters: a PCIe to M.2 M-key adapter and a M.2 M-Key to M.2 E-Key adapter (as I haven't found any direct PCIe to M.2 E-Key adapter).

Can anyone help me find out if one or both of these systems could work with the Dual EdgeTPU M.2 board?

NeuerUser commented 3 years ago

OK, did quite some research on this. Just in case someone else thinks about the same situation. Here are my findings:

So, result is then: The Coral-M2-dual-edgetpu will probably work in both systems, but always only one TPU. The second one will be missing important lanes.

The only chance to get dual-edgetpu work reasonably is to have a (at the moment not existing) PCIe x2 (or x4, x8, x16) to M.2 E-key adapter. While this seems to be a very simple board with probably nearly no components on it, there doesn't seem to exists such one. :(

mbirrell66 commented 1 year ago

The only chance to get dual-edgetpu work reasonably is to have a (at the moment not existing) PCIe x2 (or x4, x8, x16) to M.2 E-key adapter. While this seems to be a very simple board with probably nearly no components on it, there doesn't seem to exists such one. :(

https://github.com/magic-blue-smoke/Dual-Edge-TPU-Adapter