Open jimcortez opened 3 years ago
@jimcortez Hey! Following a fresh reboot, do you have anything under:
ls /dev/apex*
@Namburger no, nothing shows up. When it does work I see the /dev/apex_0 device.
@jimcortez ahh, I have a feeling that the issue is the apex and gasket driver not being loaded at all during boot... What shows up if you run those modinfo commands before reinstalling them?
@Namburger those commands were run before I reinstalled.
In another twist, they seem to have come up in my latest reboot, which was a normal reboot sequence and I had not re-installed. I am beginning to think that there may be some race condition that prevents the devices from coming up, but only in certain circumstances. I will keep checking on every reboot and try to dump dmesg if it happens again.
That's awesome lol
The best problems are problems that solves themselves The worse problems are problems that solves themselves... sometimes
Let's see what you got :)
Actually it disappeared on reboot again. I have trolled through all the system logs and can't find any errors or something that may indicate a problem. When comparing the working boots to non-working boots, I notice that gasket and apex have logs, with apex logging that it found a device. There are no references to apex or gasket in the system logs when it fails to bring up the device.
What would be actually helpful here to provide?
@mbrooksx what do you think?
@jimcortez : Can you provide a complete dmesg for both the failed and successful cases? If you're not comfortable with that, how about just a dmesg including pci (something like dmesg | grep -Ei 'pci|apex' ). It sounds to me like we're not even reaching Apex probe - which indicates (as you see in lspci) that the upstream PCIe device doesn't even see the TPU. Perhaps there is something in the log explaining why that specific PCI bus is encountering issues.
Working dmesg: https://gist.github.com/jimcortez/0460b32b6e84f2bb6d6a0b09df9b3f07
Not Working dmesg: https://gist.github.com/jimcortez/b13ddb4f20b096ece5c66a47bbd03f36
Thanks for the help @mbrooksx and @Namburger !
@mbrooksx @Namburger is there anything else I can provide?
Just as an update, I still have this issue. The only solution so far seems to keep rebooting the system until the coral devices show up.
When you boot and it doesn't work, what happens if you run:
sudo modprobe apex
Nothing, command returns immediately
Does lsmod | grep apex show the apex and gasket devices after running the modprobe command when the non-working system has booted?
modprobe is not very chatty unless a module is not found or there is an error, a quick and quiet return usually means it succeeded, so afterwards apex and gasket modules should be shown with lsmod | grep apex
Presumably once the apex module is installed the /dev/apex_0 should appear.
After a fresh boot just now:
$ ls /dev/apex*
zsh: no matches found: /dev/apex*
$ lsmod | grep apex
$ sudo modprobe apex
$ lsmod | grep apex
apex 28672 0
gasket 110592 1 apex
$ ls /dev/apex*
zsh: no matches found: /dev/apex*
Now you've gotten above my pay grade here, but it seems when it doesn't work on reboot the apex module is not being loaded. The modprobe command fixes this but the /dev/apex_0 still doesn't get created, here I suspect something is wrong with udev.
Sorry I can't be of anymore help, later tonight I'll poke around a bit in my 20.04 system and seem if anything jumps out at me to have you check.
Do you have a /etc/udev/rules.d/65-apex.rules file? Have you added the apex group? I believe your login user needs to be member of apex and plugdev groups
Seems everything is in order
$ cat /etc/udev/rules.d/65-apex.rules
SUBSYSTEM=="apex", MODE="0660", GROUP="apex"
$ groups | tr " " "\n" | grep -e apex -e plugdev
plugdev
apex
Note that rebooting the device several times does get it back. It's pretty non-deterministic, sometimes the next boot is fine, sometimes it isn't. When it does show up, everything works just fine.
As I said you've gotten above my pay grade here. I's suspect some race condition in the systemd start-up.
I'm not a fan of systemd, when it works I don't care, but when it doesn't good luck!
The udev command provided by Coral is used to set the permissions of the apex_* character device so a non-root user can use it, the issue you're experiencing is unlikely related to the udev config.
It sounds like there may be a race condition when loading the drivers after reboot or the configuration gets reverted on subsequent boot. I'm assuming the gasket
module needs to be loaded before the apex
module.
How about explicitly defining the modules to load, in the required order, in /etc/modules-load.d/modules.conf
or equivalent.
Hey, any update on this ? i've the exact same problem :(
Hi there! I followed the instructions here: https://coral.ai/docs/m2/get-started/ and was able to get the modules installed and used them for some of the samples. However, after a reboot, the devices no longer show up until I do the install cycle again. After every subsequent reboot, the devices disappear again! It's possible that I have misconfigured something, but I can't find any system logs that point to a problem.
My steps on a fairly untouched ubuntu 20.04 install:
I am using this adapter card to use the tpu on a desktop system: https://www.amazon.com/gp/product/B07JBCL1CJ
On a post-reboot system
sudo lspci -vvv
``` $ sudo lspci -vvv 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex Subsystem: ASUSTeK Computer Inc. Starship/Matisse Root Complex Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-