pgj / freebsd-wifibox

wifibox: Use Linux to drive your wireless card on FreeBSD
BSD 2-Clause "Simplified" License
158 stars 12 forks source link

Stopping and starting wifibox eventually freezes system #29

Closed orbitz closed 2 years ago

orbitz commented 2 years ago

I've had this probably in 0.9.0 and in 0.10.0. If I do service wifibox stop and then service wifibox start, eventually my system freezes. At least X11 does. I have to hard reboot my laptop.

I am on CURRENT branch so perhaps this is an issue there.

FreeBSD bsdell 14.0-CURRENT FreeBSD 14.0-CURRENT #71 main-n256742-db0110a536bf

Let me know if there is any more information I can provide.

pgj commented 2 years ago

Hi @orbitz: Even if X11 freezes, it might be still possible to switch to one of the virtual terminals by pressing Ctrl+Alt+Fn and login there. Or, if there is an SSH server running, login to the computer remotely. This may indeed be a regression in -CURRENT: things can change back and forth here and there any time. Is not there a crash dump in these cases?

Wifibox does not control the lower part of the stack, i.e. bhyve(8) and vmm(4) so any instability or bug introduced here will be just suffered. Hopefully once I will have some time to establish a custom version of bhyve(8) + vmm(4) to provide a more common and stable foundations, independent from what is going on in the base system.

orbitz commented 2 years ago

I did not include it in my original comment, but I had no luck with switching between virtual terminals. I haven't been in a position to try to ssh in though, I will try in the future.

pgj commented 2 years ago

@orbitz is this issue still happening? Have you moved to a different version of -CURRENT in the meantime?

orbitz commented 2 years ago

Yes this is still happening. I started and stopped wifibox and then switched over to the text view to see what happens (since it takes a about 30 seconds to lock up). What I saw was, what looks like, the USB driver being loaded and unloaded, and then it hits nvme issue, and then I get kernel crash. I included an image I took once it hit the debugger.

IMG20220810154433

Alix82 commented 2 years ago

Same issue here for me on a Thinkpad x270 on 13.1-RELEASE and 14-CURRENT. Restarting wifibox breaks for some reason the nvme controller.

Please note that doing wireless passthrough manually without wifibox doesn't trigger the issue, I mean attaching/detaching the ppt driver, restarting the vm don't cause any issue, only when using wifibox, and restarting it things break badly.

I can provide more information if required.

pgj commented 2 years ago

Hi @Alix82: Thanks for the additional info. It would be interesting to learn exactly after which command the nvme driver gets broken. The code does not do anything special, it simply calls the commands as you mentioned. Although I have a Thinkpad X270, I do not have an NVMe disk therefore I will not be able to reproduce this.

What commands, in what order, with what parameters did you use to implement the wireless pass-through manually when there are no issues? It might even be a timing issue, i.e. wifibox executes these commands too quickly.

Alix82 commented 2 years ago

I will debug the issue later today, I don't think that only the nvme driver gets broken, if you look at the screenshot attached by orbitz, you notice that also the uhub0 detached, so something at the pci association level gets broken when restarting wifibox. It could be as you said timing issue that triggers a bug in the kernel, I will debug it further and post here.

Manually I do

devctl set driver -f pci0:3:0:0 ppt start vm using bhyve

then I shutdown the VM normally using ssh on a root session, before going to suspend. I have blacklisted if_iwm and if_iwlwifi in both loader.conf and rc.conf. I don't unload the vmm driver.

It is interesting that wifibox restarts itself on resume from suspend, and all goes fine in this case! Only restarting it while the system is running triggers the issue.

Alix82 commented 2 years ago

After debugging the issue, It turns out that unloading the vmm module break things, which is what I wasn't doing manually.

So just by making sure that the vmm is loaded in wifibix;assert_vmm_loaded and load it if it not the case solves the issue for me!

pgj commented 2 years ago

@Alix82 what does start vm using bhyve mean in your case? Are you launching the VM shipped with wifibox with the same set of parameters, right? Note that wifibox stops the VM by sending a power-off signal (via SIGTERM) first and after 3 seconds the VM is forcefully powered off by the --force-poweroff flag of bhyvectl.

By default, devd is configured to execute the wifibox restart vmm command on resume, which will unload and re-load the vmm kernel module. Of course, this can be replaced with any other action, the restart command supports different levels (guest, netif, vmm).

Sorry, I do not understand your comment about assert_vm_loaded. The purpose of that function is load vmm if it is not present. For what it is worth, bhyve will do that as well. The assert_vm_loaded is needed only to load the right version of vmm automatically. That is only important when the sysutils/bhyve+ port is in use, but since that is known to be unstable it is not recommended to be used on FreeBSD 13 and later.

Alix82 commented 2 years ago

@Alix82 what does start vm using bhyve mean in your case? Are you launching the VM shipped with wifibox with the same set of parameters, right? Note that wifibox stops the VM by sending a power-off signal (via SIGTERM) first and after 3 seconds the VM is forcefully powered off by the --force-poweroff flag of bhyvectl.

Yes I'm using the vm shipped with wifibox, with the default parameters.

By default, devd is configured to execute the wifibox restart vmm command on resume, which will unload and re-load the vmm kernel module. Of course, this can be replaced with any other action, the restart command supports different levels (guest, netif, vmm).

yes this works fine.

Sorry, I do not understand your comment about assert_vm_loaded. The purpose of that function is load vmm if it is not present. For what it is worth, bhyve will do that as well. The assert_vm_loaded is needed only to load the right version of vmm automatically. That is only important when the sysutils/bhyve+ port is in use, but since that is known to be unstable it is not recommended to be used on FreeBSD 13 and later.

I don't have use bhyve+, It seems that the function assert_vm_loaded in wifibox script unloads and reload the vmm module, so basically disabling that function fixes the restart issue for me.

I just modified it as follows: assert_vmm_loaded() { local _kmod_path kldload vmm return 0

obviously the above code is just for testing purpose, and that fixes the issue for me.

pgj commented 2 years ago

The assert_vmm_loaded unloads vmm only if the wrong version of it is loaded. The actual unloading and reloading for suspend happens in reload_vmm, basically unconditionally. It would be interesting to learn what assert_vmm_loaded sees, why it decides to eventually reload vmm.

What does the command kldstat -v -n vmm | tail +2 | head -1 say? Also, could you please tell what the value is for the VMM_KO variable in the wifibox script (around line 18, in $LOCALBASE/sbin)?

Alix82 commented 2 years ago

ali@frix270 ~> kldstat -v -n vmm | tail +2 | head -1 56 1 0xffffffff82400000 543438 vmm.ko (/boot/kernel/vmm.ko)

${VMM_KO} is vmm.ko

pgj commented 2 years ago

Thanks @Alix82! Looks like thevmm kernel module is reloaded because the two strings (vmm.ko vs. /boot/kernel/vmm.ko) do not match, but that is unnecessary. It is curious why reloading itself causes so much trouble, nevertheless it should not happen at all.

Could you please change the value of VMM_KO to /boot/kernel/vmm.ko in the wifibox script to test if this would fix the problem?

ltning commented 2 years ago

In my case this issue happens on wifibox stop - is that expected? The console output looks exactly like the screen"picture" posted by @orbitz ..

ltning commented 2 years ago

...actually scratch that, it happens during startup after it's been stopped once. Sorry, ignore me, move along..

Alix82 commented 2 years ago

Could you please change the value of VMM_KO to /boot/kernel/vmm.ko in the wifibox script to test if this would fix the problem?

Yes that fixes the issue. IMHO, I don't think that vmm.ko should ever be unloaded when the system is running, I run, together with wifibox, other VMs as well.

pgj commented 2 years ago

If you check the code, you will see this is not about reloading the vmm kernel module, but replacing it. The unloading and reloading happens only if the vmm.ko is loaded from a path that is not expected. This makes sense when sysutils/bhyve+ is in use, because it deploys a customized version of vmm under /boot/modules, and it has to be made sure that the binaries are interacting with that one before launching the VM.

When sysutils/bhyve+ is not used (which is usually the case, as bhyve+ has been put back into the "experimental" state for the sake of better stability) this all should be a "no op". Unfortunately, it is not, and that is the bug. But now that we identified it, I can publish a proposed fix for the problem soon-ish.

Alix82 commented 2 years ago

Great, all clear now!

BTW, many thanks for the this very handy tool!

pgj commented 2 years ago

I have published a version of the net/wifibox-core port at https://github.com/pgj/freebsd-wifibox-port/tree/3496d6ae1a2e345fd842f3494d7324dd469a7e98 that includes a fix. Please test it.

Alix82 commented 2 years ago

Sorry for the delay, the new port works for me

pgj commented 2 years ago

Okay, I have added the fix to the main branch as f7ac871e5e965754c0db5074828b1ad9c8fd756d . This is going to be published as part of the next release.

pgj commented 2 years ago

By the way, @orbitz does this fix the problem for you?

orbitz commented 2 years ago

Sorry for the late reply, but I have finally tested this and it does not freeze my system. However, it never successfully comes back for me. I can't ping nothing. When I stop wifibox I do get an issue about PPT not being successfully destroyed, I'm not sure if that is related.

pgj commented 2 years ago

@orbitz what do you mean by wifibox not coming back for you? What do the logs say? Does the console command work?

orbitz commented 2 years ago

Here is some copy and paste from my console log as I toyed with it, let me know what else I can do:

orbitz@bsdell ~> sudo tail /var/run/wifibox/appliance/log/dmesg
[    1.036136] ath10k_pci 0000:00:06.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[    1.329688] ath10k_pci 0000:00:06.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
[    1.329692] ath10k_pci 0000:00:06.0: kconfig debug 0 debugfs 0 tracing 0 dfs 0 testmode 0
[    1.330037] ath10k_pci 0000:00:06.0: firmware ver WLAN.RM.4.4.1-00288- api 6 features wowlan,ignore-otp,mfp crc32 bf907c7c
[    1.402688] ath10k_pci 0000:00:06.0: board_file api 2 bmi_id N/A crc32 62812cba
[    1.508712] ath10k_pci 0000:00:06.0: htt-ver 3.87 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1
[    1.570620] ath: EEPROM regdomain: 0x6c
[    1.570637] ath: EEPROM indicates we should expect a direct regpair map
[    1.570638] ath: Country alpha2 being used: 00
[    1.570639] ath: Regpair used: 0x6c
orbitz@bsdell ~> ping www.google.com
PING www.google.com (173.194.79.105): 56 data bytes
64 bytes from 173.194.79.105: icmp_seq=0 ttl=105 time=17.653 ms
64 bytes from 173.194.79.105: icmp_seq=1 ttl=105 time=20.289 ms
64 bytes from 173.194.79.105: icmp_seq=2 ttl=105 time=13.353 ms
^C
--- www.google.com ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 13.353/17.099/20.289/2.859 ms
orbitz@bsdell ~> sudo service wifibox stop
Stopping wifibox....WARNING: The PPT device could not be destroyed.
...OK
orbitz@bsdell ~> ping www.google.com
PING www.google.com (173.194.79.147): 56 data bytes
ping: sendto: No route to host
^C
--- www.google.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
orbitz@bsdell ~> sudo service wifibox start
Starting wifibox.......OK
orbitz@bsdell ~> ping www.google.com
ping: UDP connect: No route to host
orbitz@bsdell ~> wpa_cli status
Selected interface 'wlan0'
'STATUS' command timed out.
orbitz@bsdell ~> sudo tail /var/run/wifibox/appliance/log/dmesg
[    0.805364]     BOOT_IMAGE=(host)/usr/local/share/wifibox/vmlinuz
[    0.805364]     modules=squashfs
[    1.064464] ath10k_pci 0000:00:06.0: can't derive routing for PCI INT A
[    1.064467] ath10k_pci 0000:00:06.0: PCI INT A: no GSI
[    1.064885] ath10k_pci 0000:00:06.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[    1.314228] ath10k_pci 0000:00:06.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
[    1.314232] ath10k_pci 0000:00:06.0: kconfig debug 0 debugfs 0 tracing 0 dfs 0 testmode 0
[    1.314579] ath10k_pci 0000:00:06.0: firmware ver WLAN.RM.4.4.1-00288- api 6 features wowlan,ignore-otp,mfp crc32 bf907c7c
[    1.385462] ath10k_pci 0000:00:06.0: board_file api 2 bmi_id N/A crc32 62812cba
[    1.493510] ath10k_pci 0000:00:06.0: htt-ver 3.87 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1

Inside wifibox:
wifibox:~# ping www.google.com
ping: bad address 'www.google.com'
pgj commented 2 years ago

Could you please run wifibox with debug logs (see core.conf at $ETCDIR/wifibox) and paste the relevant parts of /var/log/wifibox.log (around issuing the stop command)? The script should log the output of devctl which is responsible for destroying the PPT device.

orbitz commented 2 years ago
2022-08-28T08:13:13+0200 INFO  Begin: wifibox stop
2022-08-28T08:13:13+0200 INFO  Tearing down Unix Domain Socket pass-through
2022-08-28T08:13:13+0200 INFO  Daemonized socat processes found: [668
677]
2022-08-28T08:13:13+0200 INFO  Stopping guest wifibox, managed by PID 467
2022-08-28T08:13:13+0200 INFO  Waiting 3 seconds for the guest to stop
2022-08-28T08:13:16+0200 INFO  Forcing shutdown of guest wifibox
2022-08-28T08:13:16+0200 INFO  Destroying guest wifibox
2022-08-28T08:13:17+0200 INFO  Destroying bhyve PPT device: pci2:0:0
2022-08-28T08:13:17+0200 DEBUG [devctl] devctl: Failed to clear pci2:0:0 driver: Device not configured
2022-08-28T08:13:17+0200 WARN  The PPT device could not be destroyed
2022-08-28T08:13:17+0200 INFO  Removing null-modem devices
2022-08-28T08:13:17+0200 INFO  Null-modem devices are removed
2022-08-28T08:13:17+0200 INFO  Destroying bridge interface: wifibox0
2022-08-28T08:13:17+0200 INFO  Destroying linked tap interface: tap0
2022-08-28T08:13:18+0200 INFO  End: wifibox stop
pgj commented 2 years ago

Hrm, that is interesting... Before stopping wifibox, does pciconf show the association for the ppt driver? For example, the pciconf -l | fgrep ppt command would show that.

orbitz commented 2 years ago

I'm running wifibox now (fresh reboot) and this is output:

ppt0@pci0:2:0:0:        class=0x028000 rev=0x32 hdr=0x00 vendor=0x168c device=0x003e subvendor=0x1a56 subdevice=0x143a
pgj commented 2 years ago

And after the stop is the pciconf command's output the same? You may try to comment out the part in the wifibox script that runs the devctl clear driver command on stop to see if that helps. (It is in the destroy_vm() function.)

orbitz commented 2 years ago

After I stop, the ppt goes away and it becomes "none3"

pgj commented 2 years ago

For me, despite the devctl clear driver command, the device is still associated with the ppt driver after stop. This might has something to do with the following line I added to /boot/loader.conf:

pptdevs="3/0/0" 

Perhaps this could improve the situation for you too. However, on start, the devctl set driver should take care of the reassocation with the ppt driver. After you stopped, and start wifibox again, what do you see in /var/log/wifibox.log about the respective devctl command? There should be lines like these:

2022-08-29T00:59:13+0200 INFO  Passthru device is configured: 3/0/0
2022-08-29T00:59:13+0200 INFO  PPT driver is configured for pci3:0:0 device

It would also be curious to test what happens if you manually run the following command between stop and the subsequent start:

devctl set driver -f pci2:0:0 ppt
orbitz commented 2 years ago

All that seemed to be fine, I had those log lines. I also updated my /boo/loader.conf which seems to remove the "cannot destroy" error, but still does not come back. Do you think there is any possibility it has to do with the tap device or how networking is setup? I have a slightly custom setup.

orbitz@bsdell ~> ifconfig -a
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
        inet 127.0.0.1 netmask 0xff000000
        inet 10.10.123.4 netmask 0xffffffff
        inet 10.10.123.5 netmask 0xffffffff
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wifibox0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 58:9c:fc:00:4f:18
        inet 10.0.0.2 netmask 0xffffff00 broadcast 10.0.0.255
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 3 priority 128 path cost 2000000
        groups: bridge
        nd6 options=9<PERFORMNUD,IFDISABLED>
tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80000<LINKSTATE>
        ether 58:9c:fc:10:da:68
        groups: tap
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        Opened by PID 575
bridge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 58:9c:fc:10:ff:b9
        inet 10.10.123.1 netmask 0xffffff00 broadcast 10.10.123.255
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: tap2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 6 priority 128 path cost 2000000
        member: tap1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 5 priority 128 path cost 2000000
        groups: bridge
        nd6 options=9<PERFORMNUD,IFDISABLED>
tap1: flags=8902<BROADCAST,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80000<LINKSTATE>
        ether 58:9c:fc:10:fb:18
        groups: tap
        media: Ethernet autoselect
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
tap2: flags=8902<BROADCAST,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80000<LINKSTATE>
        ether 58:9c:fc:10:ff:da
        groups: tap
        media: Ethernet autoselect
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo1: flags=8008<LOOPBACK,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

This is with everything running correctly. I notice when I stop and start, wifibox0 doesn't get an IP, which I think should be separate from whether or not the ppt stuff is working, right?

orbitz commented 2 years ago

Ohhh! So dhclient wasn't coming up again. I'm not sure if wifibox is supposed to be responsible for that or not. But I manually did dhclient wifibox0 and bam, I'm on the internet!

orbitz commented 2 years ago

Maybe this whole thing was just because I didn't read the directions good enough, my apologies if so! I just thought doing service wifibox start would do everything.

pgj commented 2 years ago

No, wifibox will not launch dhclient by itself. It will only create the wifibox0 networking interface and then the classical tools, e.g. the netif service could be used to assign an IP address to it. It is all documented on the wifibox(8) manual page. This a conscious design choice to keep things modular (i.e. why to replicate the functionality of netif).

pgj commented 2 years ago

Dear all, can we say now that this issue has been resolved?

orbitz commented 2 years ago

Yes, everything works for me! Thank you for the hard work and patience.