Open dakota opened 3 years ago
@dakota we don't actually support virtual machines, first I would try to install it on your host system first to remove any layer of indirection
@dakota we don't actually support virtual machines, first I would try to install it on your host system first to remove any layer of indirection
Works fine on the host system with Ubuntu installed. So definitely something that ESXi is doing (or not doing). Any suggestions on where to look? Or do I just scratch the whole idea :/
Can corroborate the same results here. Same Coral and same adapter. Probably many people looking into this since the USB versions are difficult to find.
Could you shed some light into what the error messages with apex mean with respect to resource management? I'm sure with enough hints we can figure out how to tweak Esxi.
Same error messages here with ESXi 7.0 Update 1 and Debian 10 on a Supermicro SYS-E300-8D. Is there anything we can try to figure this out?
Also trying to work through this.
I got this far but as mentioned in OP I don't have /dev/apex_0 with the same dmesg you have above.
lspci -nn | grep 089a
0b:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
Out of curiousity, are you guys running this?
esxcli system settings kernel set -s vga -v FALSE
Someone mentioned, either in this thread or a different one, that ESXi was displaying the Coral as Global Unicorp via one pci-usb adapter, but when using a different adapter it showed as Google, and then it worked.
So this may be:
Is there a way to hack this identifier in ESXi? Can we change the driver to not look for an exact string?
Grasping for anything...
From: KillahB33 @.***> Sent: Thursday, 22 April 2021 05:52 To: google-coral/edgetpu Cc: MEntOMANdo; Comment Subject: Re: [google-coral/edgetpu] Apex failing with error -110 (No /dev/apex_0) (#343)
Also trying to work through this.
I got this far but as mentioned in OP I don't have /dev/apex_0 with the same dmesg you have above.
lspci -nn | grep 089a 0b:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/google-coral/edgetpu/issues/343#issuecomment-824813790, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARKXFRP2LBWCPTSHFCAPCBLTKALZHANCNFSM4ZXJNPFQ.
Also grasping, I spent the morning to see what people were doing to get GPU passthrough to work hoping there would be some overlap.
Do you have any info on the adapter? I would be curious to see what his device looks like in esxi because mine is just showing as
<class> Non-VGA unclassified device
I have the M.2 module plugged in directly to the M.2 slot on my Supermicro board.
This is how it's listed in ESXi:
Looking at the lspci -vv
output in the guest VM, I can see that MSI and MSI-X is not enabled.
0c:00.0 System peripheral: Device 1ac1:089a (prog-if ff)
Subsystem: Device 1ac1:089a
Physical Slot: 193
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 19
Region 0: Memory at e7afc000 (64-bit, prefetchable) [size=16K]
Region 2: Memory at e7900000 (64-bit, prefetchable) [size=1M]
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x32, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x32, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
Vector table: BAR=2 offset=00046800
PBA: BAR=2 offset=00046068
Capabilities: [e0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [f8] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
Capabilities: [108 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [110 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [200 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel modules: apex
My dmesg output shows this errors:
sudo dmesg |grep apex
[ 6.501952] apex 0000:0c:00.0: Page table init timed out
[ 6.502375] apex 0000:0c:00.0: MSI-X table init timed out
[ 6.503938] apex: probe of 0000:0c:00.0 failed with error -110
And I do get some logs related to PCI passthrough in my vmware.log. Looks to me like this could be the issue for the timeout.
2021-03-06T11:51:38.247Z| vcpu-0| I005: PCIPassthru: Attempted to program PCI cacheline size 32 not a power of 2 factor of original physical 64 for device 0000:0a:00.0
2021-03-06T11:51:38.248Z| vcpu-0| W003: PCIPassthruHandleCapabilities: Ignoring write to AER register 0x4
2021-03-06T11:51:38.249Z| vcpu-0| W003: PCIPassthruHandleCapabilities: Ignoring write to AER register 0x1
Is this a driver issue or am I missing some other setting?
Would be good thing to know, I have had other things that didn't work on esxi but passed through fine so this seems odd.
I tried a few thing today with no luck.
I updated passthru.map with device info I then add hypervisor.cpu_0 FALSE (supposed to trick ubuntu into thinking it's not a vm) and reinstalled tpu stuff after this too I am not sure what's next, I think it's something to do with the pass through because people are saying they can't passthrough usb coral but they can pass through a usb hub with coral attached. I might see if I can get a usb adapter for the ssd adapter I got and try that.
I also tested a fresh install with UEFI instead of legacy BIOS. Same problem. Only difference is this first extra line:
[ 4.972126] apex 0000:13:00.0: enabling device (0000 -> 0002)
[ 5.041622] apex 0000:13:00.0: Page table init timed out
[ 5.042078] apex 0000:13:00.0: MSI-X table init timed out
[ 5.045108] apex: probe of 0000:13:00.0 failed with error -110
Additionally setting pciPassthru.use64bitMMIO="TRUE"
or pciPassthru.64bitMMIOSizeGB = “128"
in the .vmx file like mentioned in this KB https://kb.vmware.com/s/article/2142307 doesn't make a difference.
Hm interesting, well I ordered a usb3 card and a enclosure, will update if that works. It's a no go, unlike the pci cards and everything else the usb drive has firmware in it that tells the OS that it's a storage drive so back to square one. Not sure what I will try next.
Anyone have any luck? I'm stuck at same spot. Using ableconn adapter, able to pass the PCI device through and see it in ubuntu VM but apex device isn't created.
I couldn't get it to work and have instead resorted to a small standalone computer with the Coral in instead of in a VM.
I couldn't get it to work and have instead resorted to a small standalone computer with the Coral in instead of in a VM.
Which did you go with? I also did the same, went with a Jetson Nano
I couldn't get it to work and have instead resorted to a small standalone computer with the Coral in instead of in a VM.
Which did you go with? I also did the same, went with a Jetson Nano
An old AMD Athlon 5150 based computer that I had lying around. Busy shopping around for a Dell Optiplex SFF (or similar) with a more modern CPU.
Someone mentioned, either in this thread or a different one, that ESXi was displaying the Coral as Global Unicorp via one pci-usb adapter, but when using a different adapter it showed as Google, and then it worked.
So this may be:
- totally irrelevant
- specific to usb vs pci/m2
- or a clue that the apex driver is looking for a specific identifier, not seeing it, and failing to load
Is there a way to hack this identifier in ESXi? Can we change the driver to not look for an exact string?
Grasping for anything...
I don't think this is it since it displays exactly the same on my standalone AMD Athlon based system.
I fixed my issues... first comment on this thread.
Running on an HP proliant gen8, proxmox passing through to ubuntu but had it working with HassOS as well.
I was not successful with ESXi and migrated hypervisors because of it.
There was also some HP specific bug I had to resolve? Getting this working was a deep, dark rabbit hole but it's working great now with frigate.
Here's the HP specific bit:
I also just found this guide https://www.reddit.com/r/Proxmox/comments/n34f8q/proxmox_vm_ubuntu_2004_frigate_2x_google_coral_tpu/
It seems like the root of the issue is that ESXi's implementation of MSI-X is broken, or at least causes issues with the Apex driver.
@jayburkard I am actually just looking at migrating as well because of all of this. My jetson nano isn't working nearly as well as I though it would so I would rather use it for something specific to it's use case and have this in my main docker setup. Will probably do the migration this weekend. Luckily only two vms for me so shouldn't take too long. @dakota is there similar process that us on esxi can do?
@jayburkard I am actually just looking at migrating as well because of all of this. My jetson nano isn't working nearly as well as I though it would so I would rather use it for something specific to it's use case and have this in my main docker setup. Will probably do the migration this weekend. Luckily only two vms for me so shouldn't take too long. @dakota is there similar process that us on esxi can do?
Same here, 2 VMs (had 3 but migrated HassOS to run supervised in ubuntu - every time I provisioned >1TB for it hassos would crash/not boot, known bug with no resolution).
I think you can absolutely get the hardware working on proxmox. I had to dig to find all that but now that it's working it's been flawless and if you have anything but a proliant g8 you should have way fewer steps :)
I wanted ESXi to keep my skills up for industry but I get plenty of experience at work and really like proxmox.
@dakota is there similar process that us on esxi can do?
No idea, I gave up on trying to get it working on ESXi and rather got a cheap dedicated computer for the device.
@dakota we don't actually support virtual machines, first I would try to install it on your host system first to remove any layer of indirection
Then support it?
It's one of the most common use cases in the industry, put it on your roadmap or something.
@dakota is there similar process that us on esxi can do?
No idea, I gave up on trying to get it working on ESXi and rather got a cheap dedicated computer for the device.
I see this is still getting action so just gonna post my solution. I moved over to proxmox as there has been a posted solution.
@darkalfx this isn't a coral thing though, this is an issue with how hypervisors deal with this hardware.
Is there any update on this issue and where it is on the roadmap to be fixed?
Same problem here with PCI passthrough for m.2 flavour of Coral, on ESXi 7.0 Update 2 + Supermicro H11SSL-I + EPYC 7551, running Ubuntu 20.04.3 VM.
The device is detected correctly:
# lspci | grep 1b:00
1b:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU
But apex module is not happy:
# dmesg -T | grep 1b:00
pci 0000:1b:00.0: [1ac1:089a] type 00 class 0x0000ff
pci 0000:1b:00.0: reg 0x10: [mem 0xcfffc000-0xcfffffff 64bit pref]
pci 0000:1b:00.0: reg 0x18: [mem 0xcfe00000-0xcfefffff 64bit pref]
apex 0000:1b:00.0: Page table init timed out
apex 0000:1b:00.0: MSI-X table init timed out
apex: probe of 0000:1b:00.0 failed with error -110
Using various kernel options, like pcie_aspm=off gasket.dma_bit_mask=32 pci=nocrs
, as well as pciPassthru.use64bitMMIO=TRUE
on VMware side does not make any difference.
No problem with PCI passthrough for any other devices in my setup.
Getting this to work properly under ESXi would be legendary as more and more people are getting m.2 verson of Coral, since it is the only thing in stock (although USB flaviour is tricky to virtualize too, due this entire VID/PID-changes-on-the-fly drama...)
Has anyone tried any other OS beside Ubuntu? This weekend I am gonna try windows VM 🤷♂️
I'm experiencing the same issue. I've been pulling my hair out for the last several hours thinking I was missing something, but I guess this is a known issue. Doing passthrough via ESXI 7 just does not work! It's a shame to have to invest in another dedicated server just to use this :(
I'm experiencing the same issue. I've been pulling my hair out for the last several hours thinking I was missing something, but I guess this is a known issue. Doing passthrough via ESXI 7 just does not work! It's a shame to have to invest in another dedicated server just to use this :(
Unfortunately I had to migrate some of my servers to Proxmox :(
I'm experiencing the same issue. I've been pulling my hair out for the last several hours thinking I was missing something, but I guess this is a known issue. Doing passthrough via ESXI 7 just does not work! It's a shame to have to invest in another dedicated server just to use this :(
I migrated my one server to proxmox as well. It wasn't as bad as I was expecting but I was only running UnRaid and a VM for Docker.
I have exactly the same problem. Same behavior, same errors. Proxmox may be a workaround but not a solution to this problem.
Are there any updaes?
Has anyone tried any other OS beside Ubuntu? This weekend I am gonna try windows VM man_shrugging
Yes, I have tested (Ubuntu) Debian, FreeBSD (TrueNAS) and Windows (also different versions), everywhere the same...
Following this.. +1 for esxi passthrough support / workaround
This is of significant interest to the homelab community right now, hopefully someone like @lamw might be able to socialise internally with VMware engineering to see if there are any workarounds/fixes that would allow people to stay on ESXi.
Hi @smallsam - Appreciate you making me aware of this. Since the PCIe device isn't on VMware's HCL (supported HW device), its really hard to say what the issue could be and not having the device on hand would make it difficult to troubleshoot. It sounds like the USB-based version does work as expected, just not the PCIe device?
Since you mentioned "significant interests", could you help me understand the underlying use case for using this particular device versus another, this isn't a space I'm familiar with. What is the demand that you see and is there a particular version of ESXi you're looking to use? Also, could you help understand how Coral Edge TPU software is being used? Is this for development purposes, testing, running workloads, etc? Anything that you or others can chime in can at least help build a potential business case but because this isn't a supported device, there's not much I can do at the current moment to engage with Engineering.
Unfortunately the device in question violates PCI specification by mapping PBA, MSI-X vector table, and other registers into same 4KB page (PBA is at 0x46068, VT at 0x46800, but there is a bunch of other registers in 0x46XXX range). PCIe spec 6.0, page 1020, has this to say:
<quote>
If a Base Address Register or entry in the Enhanced Allocation capability that maps address space for the MSI-X Table or
MSI-X PBA also maps other usable address space that is not associated with MSI-X structures, locations (e.g., for CSRs)
used in the other address space must not share any naturally aligned 4-KB address range with one where either MSI-X
structure resides. This allows system software where applicable to use different processor attributes for MSI-X structures
and the other address space. (Some processor architectures do not support having different processor attributes
associated with the same naturally aligned 4-KB physical address range.) The MSI-X Table and MSI-X PBA are permitted
to co-reside within a naturally aligned 4-KB address range, though they must not overlap with each other.
</quote>
So having CSR registers in same page as MSI-X VT page violates the spec, and under ESXi CSR registers become unreachable (writes ignored, reads return zeroes). Due to this device driver cannot correctly initialize device.
If firmware can modify device's behavior so that VT/PBA arrays do not share same 4KB page with other registers, device will work with ESXi's passthrough. Or if firmware can hide MSI-X capability from PCI configuration space, that would fix issue as well.
If such change is not possible, then device can't be used with ESXi fixed pass through at this moment.
Who would be able to determine if these registers can be modified/hidden via firmware? Still hoping for a solution here - and with the Coral USB shortage I assume there's a good bunch of people who would benefit from Coral PCIe TPU + ESXi working together!
The TPU is used with Frigate NVR for AI object detection / classification https://github.com/blakeblackshear/frigate
There's tight integration with Home Assistant https://github.com/blakeblackshear/frigate-hass-integration
Has to be near and dear to your homelab heart @lamw :)
I am also experiencing the same issue. Has there been any movement on getting the issue resolved on either the VMware side or Coral device?
I am also experiencing the same issue. Trying to get it to work with VMware ESXI 8.0
Ah crap. Im another +1 on this. Shame I didnt see this post before I found the error apex: probe of 0000:04:00.0 failed with error -110
. Looks like I will have to bit the bullet and migrate my VMs to proxmox.
Yet another +1. Did not see this before I bought a few M.2 Corals. Would really like to get this running on my servers. Migration is not an option due to size.
In reading this post about the USB version of this device https://williamlam.com/2023/05/google-coral-usb-edge-tpu-accelerator-on-esxi.html (Big thanks to @lamw for this work!) It does make me think whether a similar firmware update process is happening with the pcie device. If so perhaps the resulting (hotplugged pcie?!) device would be conformant for passthrough, as @petr-vmware-com suggests above. Indeed perhaps this is why "it works" on Proxmox as the device might have had its firmware uploaded by the host OS. I no longer have access to the hardware to investigate this train but perhaps this is something to look at?
@smallsam possibly ... If someone can get me the device, happy to take a look when I get some time
@smallsam possibly ... If someone can get me the device, happy to take a look when I get some time
@lamw I'll happily ship you the TPU if you'll send it back when you're done! Let me know when you'll have time (a mystical thing of which I have sadly little!)
@k1n6b0b sure. I don't think GH allows DM but you can use our HQ address with attention to William Lam
3401 Hillview Ave. Palo Alto, CA 94304
Darn. Wish I would have found this before I bought two of these m.2 TPU devices. Still had time to cancel my backorder for two more thankfully.
If someone can get me the device, happy to take a look when I get some time
@lamw did you get hold of one of these devices yet? If not I can get one to you fairly quickly (& don't need it back).
@chris20 No, I never received the device
@lamw Ok, I ordered one via Amazon (though it’s a third-party seller so the delivery won’t be from them). In the delivery instructions I said “mailroom”; if it should go to front office or reception please let me know and I’ll amend the order. You should have on Wed or Thu this week.
(I’d have sent you one of the m.2’s I had here but I’m in Australia so this seemed easier :)
Thanks.The address is our main shipping, no need for instructions as long as you've got my full name on receiver.
JFYI - I've been heads down preparing for our upcoming conference,so won't be able to do anything until after that at earliest
@chris20 just want to ACK that I've received the M.2 TPU
🙌🙌🙌 apologies I never sent my TPU, it's still deep in the supermicro. Awesome to see someone else was able to contribute one. 🤞 you find the resolution!
I'm using the mini-pcie version with this Ableconn adapter that has multiple reports of working. It is in Dell PowerEdge T420 running ESXi, and being passed through to a Ubuntu 20.04 VM. Followed the official guide to install the drivers.
The
/dev/apex_0
device is however not showing. Any ideas (bunch of debug info below)?lscpu
uname -a
dmesg | grep apex
lspci
lspci -vvv
modinfo gasket