Closed orbitz closed 2 years ago
Hi @orbitz: Even if X11 freezes, it might be still possible to switch to one of the virtual terminals by pressing Ctrl+Alt+Fn and login there. Or, if there is an SSH server running, login to the computer remotely. This may indeed be a regression in -CURRENT
: things can change back and forth here and there any time. Is not there a crash dump in these cases?
Wifibox does not control the lower part of the stack, i.e. bhyve(8)
and vmm(4)
so any instability or bug introduced here will be just suffered. Hopefully once I will have some time to establish a custom version of bhyve(8)
+ vmm(4)
to provide a more common and stable foundations, independent from what is going on in the base system.
I did not include it in my original comment, but I had no luck with switching between virtual terminals. I haven't been in a position to try to ssh in though, I will try in the future.
@orbitz is this issue still happening? Have you moved to a different version of -CURRENT
in the meantime?
Yes this is still happening. I started and stopped wifibox and then switched over to the text view to see what happens (since it takes a about 30 seconds to lock up). What I saw was, what looks like, the USB driver being loaded and unloaded, and then it hits nvme issue, and then I get kernel crash. I included an image I took once it hit the debugger.
Same issue here for me on a Thinkpad x270 on 13.1-RELEASE and 14-CURRENT. Restarting wifibox breaks for some reason the nvme controller.
Please note that doing wireless passthrough manually without wifibox doesn't trigger the issue, I mean attaching/detaching the ppt driver, restarting the vm don't cause any issue, only when using wifibox, and restarting it things break badly.
I can provide more information if required.
Hi @Alix82: Thanks for the additional info. It would be interesting to learn exactly after which command the nvme
driver gets broken. The code does not do anything special, it simply calls the commands as you mentioned. Although I have a Thinkpad X270, I do not have an NVMe disk therefore I will not be able to reproduce this.
What commands, in what order, with what parameters did you use to implement the wireless pass-through manually when there are no issues? It might even be a timing issue, i.e. wifibox
executes these commands too quickly.
I will debug the issue later today, I don't think that only the nvme driver gets broken, if you look at the screenshot attached by orbitz, you notice that also the uhub0 detached, so something at the pci association level gets broken when restarting wifibox. It could be as you said timing issue that triggers a bug in the kernel, I will debug it further and post here.
Manually I do
devctl set driver -f pci0:3:0:0 ppt start vm using bhyve
then I shutdown the VM normally using ssh on a root session, before going to suspend. I have blacklisted if_iwm and if_iwlwifi in both loader.conf and rc.conf. I don't unload the vmm driver.
It is interesting that wifibox restarts itself on resume from suspend, and all goes fine in this case! Only restarting it while the system is running triggers the issue.
After debugging the issue, It turns out that unloading the vmm module break things, which is what I wasn't doing manually.
So just by making sure that the vmm is loaded in wifibix;assert_vmm_loaded and load it if it not the case solves the issue for me!
@Alix82 what does start vm using bhyve mean in your case? Are you launching the VM shipped with wifibox
with the same set of parameters, right? Note that wifibox
stops the VM by sending a power-off signal (via SIGTERM
) first and after 3 seconds the VM is forcefully powered off by the --force-poweroff
flag of bhyvectl
.
By default, devd
is configured to execute the wifibox restart vmm
command on resume, which will unload and re-load the vmm
kernel module. Of course, this can be replaced with any other action, the restart
command supports different levels (guest
, netif
, vmm
).
Sorry, I do not understand your comment about assert_vm_loaded
. The purpose of that function is load vmm
if it is not present. For what it is worth, bhyve
will do that as well. The assert_vm_loaded
is needed only to load the right version of vmm
automatically. That is only important when the sysutils/bhyve+
port is in use, but since that is known to be unstable it is not recommended to be used on FreeBSD 13 and later.
@Alix82 what does start vm using bhyve mean in your case? Are you launching the VM shipped with
wifibox
with the same set of parameters, right? Note thatwifibox
stops the VM by sending a power-off signal (viaSIGTERM
) first and after 3 seconds the VM is forcefully powered off by the--force-poweroff
flag ofbhyvectl
.
Yes I'm using the vm shipped with wifibox, with the default parameters.
By default,
devd
is configured to execute thewifibox restart vmm
command on resume, which will unload and re-load thevmm
kernel module. Of course, this can be replaced with any other action, therestart
command supports different levels (guest
,netif
,vmm
).
yes this works fine.
Sorry, I do not understand your comment about
assert_vm_loaded
. The purpose of that function is loadvmm
if it is not present. For what it is worth,bhyve
will do that as well. Theassert_vm_loaded
is needed only to load the right version ofvmm
automatically. That is only important when thesysutils/bhyve+
port is in use, but since that is known to be unstable it is not recommended to be used on FreeBSD 13 and later.
I don't have use bhyve+, It seems that the function assert_vm_loaded in wifibox script unloads and reload the vmm module, so basically disabling that function fixes the restart issue for me.
I just modified it as follows: assert_vmm_loaded() { local _kmod_path kldload vmm return 0
obviously the above code is just for testing purpose, and that fixes the issue for me.
The assert_vmm_loaded
unloads vmm
only if the wrong version of it is loaded. The actual unloading and reloading for suspend happens in reload_vmm
, basically unconditionally. It would be interesting to learn what assert_vmm_loaded
sees, why it decides to eventually reload vmm
.
What does the command kldstat -v -n vmm | tail +2 | head -1
say? Also, could you please tell what the value is for the VMM_KO
variable in the wifibox
script (around line 18, in $LOCALBASE/sbin
)?
ali@frix270 ~> kldstat -v -n vmm | tail +2 | head -1 56 1 0xffffffff82400000 543438 vmm.ko (/boot/kernel/vmm.ko)
${VMM_KO} is vmm.ko
Thanks @Alix82! Looks like thevmm
kernel module is reloaded because the two strings (vmm.ko
vs. /boot/kernel/vmm.ko
) do not match, but that is unnecessary. It is curious why reloading itself causes so much trouble, nevertheless it should not happen at all.
Could you please change the value of VMM_KO
to /boot/kernel/vmm.ko
in the wifibox
script to test if this would fix the problem?
In my case this issue happens on wifibox stop
- is that expected? The console output looks exactly like the screen"picture" posted by @orbitz ..
...actually scratch that, it happens during startup after it's been stopped once. Sorry, ignore me, move along..
Could you please change the value of
VMM_KO
to/boot/kernel/vmm.ko
in thewifibox
script to test if this would fix the problem?
Yes that fixes the issue. IMHO, I don't think that vmm.ko should ever be unloaded when the system is running, I run, together with wifibox, other VMs as well.
If you check the code, you will see this is not about reloading the vmm
kernel module, but replacing it. The unloading and reloading happens only if the vmm.ko
is loaded from a path that is not expected. This makes sense when sysutils/bhyve+
is in use, because it deploys a customized version of vmm
under /boot/modules
, and it has to be made sure that the binaries are interacting with that one before launching the VM.
When sysutils/bhyve+
is not used (which is usually the case, as bhyve+
has been put back into the "experimental" state for the sake of better stability) this all should be a "no op". Unfortunately, it is not, and that is the bug. But now that we identified it, I can publish a proposed fix for the problem soon-ish.
Great, all clear now!
BTW, many thanks for the this very handy tool!
I have published a version of the net/wifibox-core
port at https://github.com/pgj/freebsd-wifibox-port/tree/3496d6ae1a2e345fd842f3494d7324dd469a7e98 that includes a fix. Please test it.
Sorry for the delay, the new port works for me
Okay, I have added the fix to the main
branch as f7ac871e5e965754c0db5074828b1ad9c8fd756d . This is going to be published as part of the next release.
By the way, @orbitz does this fix the problem for you?
Sorry for the late reply, but I have finally tested this and it does not freeze my system. However, it never successfully comes back for me. I can't ping nothing. When I stop wifibox I do get an issue about PPT not being successfully destroyed, I'm not sure if that is related.
@orbitz what do you mean by wifibox
not coming back for you? What do the logs say? Does the console
command work?
Here is some copy and paste from my console log as I toyed with it, let me know what else I can do:
orbitz@bsdell ~> sudo tail /var/run/wifibox/appliance/log/dmesg
[ 1.036136] ath10k_pci 0000:00:06.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[ 1.329688] ath10k_pci 0000:00:06.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
[ 1.329692] ath10k_pci 0000:00:06.0: kconfig debug 0 debugfs 0 tracing 0 dfs 0 testmode 0
[ 1.330037] ath10k_pci 0000:00:06.0: firmware ver WLAN.RM.4.4.1-00288- api 6 features wowlan,ignore-otp,mfp crc32 bf907c7c
[ 1.402688] ath10k_pci 0000:00:06.0: board_file api 2 bmi_id N/A crc32 62812cba
[ 1.508712] ath10k_pci 0000:00:06.0: htt-ver 3.87 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1
[ 1.570620] ath: EEPROM regdomain: 0x6c
[ 1.570637] ath: EEPROM indicates we should expect a direct regpair map
[ 1.570638] ath: Country alpha2 being used: 00
[ 1.570639] ath: Regpair used: 0x6c
orbitz@bsdell ~> ping www.google.com
PING www.google.com (173.194.79.105): 56 data bytes
64 bytes from 173.194.79.105: icmp_seq=0 ttl=105 time=17.653 ms
64 bytes from 173.194.79.105: icmp_seq=1 ttl=105 time=20.289 ms
64 bytes from 173.194.79.105: icmp_seq=2 ttl=105 time=13.353 ms
^C
--- www.google.com ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 13.353/17.099/20.289/2.859 ms
orbitz@bsdell ~> sudo service wifibox stop
Stopping wifibox....WARNING: The PPT device could not be destroyed.
...OK
orbitz@bsdell ~> ping www.google.com
PING www.google.com (173.194.79.147): 56 data bytes
ping: sendto: No route to host
^C
--- www.google.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
orbitz@bsdell ~> sudo service wifibox start
Starting wifibox.......OK
orbitz@bsdell ~> ping www.google.com
ping: UDP connect: No route to host
orbitz@bsdell ~> wpa_cli status
Selected interface 'wlan0'
'STATUS' command timed out.
orbitz@bsdell ~> sudo tail /var/run/wifibox/appliance/log/dmesg
[ 0.805364] BOOT_IMAGE=(host)/usr/local/share/wifibox/vmlinuz
[ 0.805364] modules=squashfs
[ 1.064464] ath10k_pci 0000:00:06.0: can't derive routing for PCI INT A
[ 1.064467] ath10k_pci 0000:00:06.0: PCI INT A: no GSI
[ 1.064885] ath10k_pci 0000:00:06.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[ 1.314228] ath10k_pci 0000:00:06.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
[ 1.314232] ath10k_pci 0000:00:06.0: kconfig debug 0 debugfs 0 tracing 0 dfs 0 testmode 0
[ 1.314579] ath10k_pci 0000:00:06.0: firmware ver WLAN.RM.4.4.1-00288- api 6 features wowlan,ignore-otp,mfp crc32 bf907c7c
[ 1.385462] ath10k_pci 0000:00:06.0: board_file api 2 bmi_id N/A crc32 62812cba
[ 1.493510] ath10k_pci 0000:00:06.0: htt-ver 3.87 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1
Inside wifibox:
wifibox:~# ping www.google.com
ping: bad address 'www.google.com'
Could you please run wifibox
with debug
logs (see core.conf
at $ETCDIR/wifibox
) and paste the relevant parts of /var/log/wifibox.log
(around issuing the stop
command)? The script should log the output of devctl
which is responsible for destroying the PPT device.
2022-08-28T08:13:13+0200 INFO Begin: wifibox stop
2022-08-28T08:13:13+0200 INFO Tearing down Unix Domain Socket pass-through
2022-08-28T08:13:13+0200 INFO Daemonized socat processes found: [668
677]
2022-08-28T08:13:13+0200 INFO Stopping guest wifibox, managed by PID 467
2022-08-28T08:13:13+0200 INFO Waiting 3 seconds for the guest to stop
2022-08-28T08:13:16+0200 INFO Forcing shutdown of guest wifibox
2022-08-28T08:13:16+0200 INFO Destroying guest wifibox
2022-08-28T08:13:17+0200 INFO Destroying bhyve PPT device: pci2:0:0
2022-08-28T08:13:17+0200 DEBUG [devctl] devctl: Failed to clear pci2:0:0 driver: Device not configured
2022-08-28T08:13:17+0200 WARN The PPT device could not be destroyed
2022-08-28T08:13:17+0200 INFO Removing null-modem devices
2022-08-28T08:13:17+0200 INFO Null-modem devices are removed
2022-08-28T08:13:17+0200 INFO Destroying bridge interface: wifibox0
2022-08-28T08:13:17+0200 INFO Destroying linked tap interface: tap0
2022-08-28T08:13:18+0200 INFO End: wifibox stop
Hrm, that is interesting... Before stopping wifibox
, does pciconf
show the association for the ppt
driver? For example, the pciconf -l | fgrep ppt
command would show that.
I'm running wifibox now (fresh reboot) and this is output:
ppt0@pci0:2:0:0: class=0x028000 rev=0x32 hdr=0x00 vendor=0x168c device=0x003e subvendor=0x1a56 subdevice=0x143a
And after the stop
is the pciconf
command's output the same? You may try to comment out the part in the wifibox
script that runs the devctl clear driver
command on stop
to see if that helps. (It is in the destroy_vm()
function.)
After I stop, the ppt goes away and it becomes "none3"
For me, despite the devctl clear driver
command, the device is still associated with the ppt
driver after stop
. This might has something to do with the following line I added to /boot/loader.conf
:
pptdevs="3/0/0"
Perhaps this could improve the situation for you too. However, on start
, the devctl set driver
should take care of the reassocation with the ppt
driver. After you stopped, and start wifibox
again, what do you see in /var/log/wifibox.log
about the respective devctl
command? There should be lines like these:
2022-08-29T00:59:13+0200 INFO Passthru device is configured: 3/0/0
2022-08-29T00:59:13+0200 INFO PPT driver is configured for pci3:0:0 device
It would also be curious to test what happens if you manually run the following command between stop
and the subsequent start
:
devctl set driver -f pci2:0:0 ppt
All that seemed to be fine, I had those log lines. I also updated my /boo/loader.conf
which seems to remove the "cannot destroy" error, but still does not come back. Do you think there is any possibility it has to do with the tap device or how networking is setup? I have a slightly custom setup.
orbitz@bsdell ~> ifconfig -a
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff000000
inet 10.10.123.4 netmask 0xffffffff
inet 10.10.123.5 netmask 0xffffffff
groups: lo
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wifibox0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 58:9c:fc:00:4f:18
inet 10.0.0.2 netmask 0xffffff00 broadcast 10.0.0.255
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 3 priority 128 path cost 2000000
groups: bridge
nd6 options=9<PERFORMNUD,IFDISABLED>
tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000<LINKSTATE>
ether 58:9c:fc:10:da:68
groups: tap
media: Ethernet autoselect
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
Opened by PID 575
bridge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 58:9c:fc:10:ff:b9
inet 10.10.123.1 netmask 0xffffff00 broadcast 10.10.123.255
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
member: tap2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 6 priority 128 path cost 2000000
member: tap1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 5 priority 128 path cost 2000000
groups: bridge
nd6 options=9<PERFORMNUD,IFDISABLED>
tap1: flags=8902<BROADCAST,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000<LINKSTATE>
ether 58:9c:fc:10:fb:18
groups: tap
media: Ethernet autoselect
status: no carrier
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
tap2: flags=8902<BROADCAST,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000<LINKSTATE>
ether 58:9c:fc:10:ff:da
groups: tap
media: Ethernet autoselect
status: no carrier
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo1: flags=8008<LOOPBACK,MULTICAST> metric 0 mtu 16384
options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
groups: lo
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
This is with everything running correctly. I notice when I stop and start, wifibox0 doesn't get an IP, which I think should be separate from whether or not the ppt stuff is working, right?
Ohhh! So dhclient wasn't coming up again. I'm not sure if wifibox is supposed to be responsible for that or not. But I manually did dhclient wifibox0
and bam, I'm on the internet!
Maybe this whole thing was just because I didn't read the directions good enough, my apologies if so! I just thought doing service wifibox start
would do everything.
No, wifibox
will not launch dhclient
by itself. It will only create the wifibox0
networking interface and then the classical tools, e.g. the netif
service could be used to assign an IP address to it. It is all documented on the wifibox(8)
manual page. This a conscious design choice to keep things modular (i.e. why to replicate the functionality of netif
).
Dear all, can we say now that this issue has been resolved?
Yes, everything works for me! Thank you for the hard work and patience.
I've had this probably in 0.9.0 and in 0.10.0. If I do
service wifibox stop
and thenservice wifibox start
, eventually my system freezes. At least X11 does. I have to hard reboot my laptop.I am on CURRENT branch so perhaps this is an issue there.
Let me know if there is any more information I can provide.