pgj / freebsd-wifibox

wifibox: Use Linux to drive your wireless card on FreeBSD
BSD 2-Clause "Simplified" License
159 stars 12 forks source link

wifibox console: no output; wpa_supplicant repeatedly exits/cannot find interfaces #100

Closed sk-spr closed 1 week ago

sk-spr commented 4 months ago

Description

A clear and concise description of what the problem is. What is expected from the application? wifibox console produces no output other than the usual

Connecting, type "~." to leave the session...
Connected
~
[EOT]
Finished.

there is no login prompt. In addition, dmesg shows repeated exits of wpa_supplicant, this may or may not be related.

Host operating system

Output of uname -a. FreeBSD freebsdbook 14.0-RELEASE FreeBSD 14.0-RELEASE #0 releng/14.0-n265380-f9716eee8ab4: Fri Nov 10 05:57:23 UTC 2023 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

Wireless NIC

Relevant output of pciconf -lv.

none0@pci0:2:0:0:   class=0x028000 rev=0x00 hdr=0x00 vendor=0x14c3 device=0x7961 subvendor=0x1a3b subdevice=0x4680
    vendor     = 'MEDIATEK Corp.'
    device     = 'MT7921 802.11ax PCI Express Wireless Network Adapter'
    class      = network

Wifibox version

Output of wifibox version. Note that only the latest version is supported.

wifibox version 0.13.0
Disk image checksum: 94904a08ee902c4a55dd345f8fffa2a96191cc8ad49f24031fe6eaf4acda5d05

Disk image type and version

The kind of VM image in use, e.g. Wifibox/Alpine, and its version. wifibox-alpine-20240506

Changes to the default configuration files

Include relevant sections from all the configuration files that contain local changes and may help to reproduce the problem.

bhyve.conf:

console=yes
passthru=2/0/0

wpa_supplicant.conf:

network={
    ssid="<correct ssid>"
    psk="<correct key>"
    key_mgmt=WPA-PSK
}

Logs

Please copy and paste any relevant log output from the following sources:



### Additional context

Add any other context about the problem here that might help the investigation.
I've tried both wifibox-alpine and wifibox-mediatek guests, and both behave this way.

### Have you tried to turn it on and off?

- [X] Yes, I have read all the manual pages first!
pgj commented 4 months ago

The difference between the wifibox-alpine and wifibox-alpine-mediatek guests is not that big: the latter image is slimmer because it contains the firmware files for the Mediatek cards only. That is basically a flavor of the former.

There must be some problem with starting up the mt7921e driver because the wlan0 device could not be created on the guest. A reason for the guest being unresponsive when attached by wifibox console might be that it freezes on boot. However, according to the logs, the guest can somehow boot up. Another possible reason might be that the nmdm driver does not properly and the communication itself becomes stuck.

It might be curious to try the following. In one of the terminals, start the wifibox console command and leave it running. Then, in a different terminal, use the following command to restart the guest:

# wifibox restart guest

You should see the boot messages of Linux kernel on the terminal where the console is running.

Also, cannot you see any odd message in the logs of the host?

sk-spr commented 4 months ago

The difference between the wifibox-alpine and wifibox-alpine-mediatek guests is not that big: the latter image is slimmer because it contains the firmware files for the Mediatek cards only. That is basically a flavor of the former.

Thank you, this was not entirely clear to me but now I know :)

It might be curious to try the following. In one of the terminals, start the wifibox console command and leave it running. Then, in a different terminal, use the following command to restart the guest:

# wifibox restart guest

You should see the boot messages of Linux kernel on the terminal where the console is running.

Indeed I can, and it even lets me log in apparently. Relevant information from the boot messages might include the following:

* Could not find a wireless interface
 * /var/run/wpa_supplicant: creating directory
 * Starting WPA Supplicant ... [ ok ]
 * Starting networking ... *   lo ... [ ok ]
 *   eth0 ... [ ok ]
 *   wlan0 ...ip: SIOCGIFFLAGS: No such device
dhcpcd-10.0.6 starting
DUID 00:01:00:01:2d:d2:8e:06:00:a0:98:8a:05:71
wlan0: interface not found
dhcpcd exited

Running lspci -k on the guest appears to show that the NIC is passed through and picked up by mt7921e:

wifibox:~# lspci -k
00:1f.0 Class 0601: 8086:7000
00:04.2 Class 0100: 1af4:1009 virtio-pci
00:04.0 Class 0100: 1af4:1001 virtio-pci
00:00.0 Class 0600: 1275:1275
00:04.3 Class 0100: 1af4:1009 virtio-pci
00:06.0 Class 0280: 14c3:7961 mt7921e
00:04.1 Class 0100: 1af4:1009 virtio-pci
00:05.0 Class 0200: 8086:100f e1000

however, the interfacewlan0 is not up as already indicated by the boot log, confirmed by ifconfig:

wifibox:~# ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr 90:E8:68:43:F1:69  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Also, cannot you see any odd message in the logs of the host? Yes, this dmesg line seems odd, since the output from pciconf -lv indicates the card is at pci0:2:0:0, though I might be off here:

pci2: <network> at device 0.0 (no driver attached)

This mirrors another odd warning when restarting wifibox:

Restarting wifibox....WARNING: PPT device pci2:0:0 could not be destroyed.
.....OK

Here again, shouldn't the device be pci0:2:0:0? The manpages indicate that my passthru notation (see above) is correct for this device...

pgj commented 4 months ago

Thanks for the inputs. If you can see the card in the lspci -k output with the driver attached on the guest, it means that you did everything right on your side.

The no driver attached message in the host's dmesg is completely normal. However, the could not be destroyed message in the Wifibox log is a bit worrisome. It is worth to know that the relationship of bhyve with Mediatek cards can be flaky, see https://github.com/pgj/freebsd-wifibox/issues/73 for the details.

I am a bit confused about the ifconfig output on the guest. So you say that there is the wlan0 device but it is not in the "up" state? You can try to initiate the reconfiguration of the network interface on wifibox console with the following command:

wifibox# service networking restart

It would be interesting to see what wpa_supplicant actually complains about. You can get more logs from it by stopping the related the service on the guest:

wifibox# service wpa_supplicant stop

and then start WPA Supplicant in the foreground (to watch the logs):

wifibox# wpa_supplicant -iwlan0 -dd -c/etc/wpa_supplicant/wpa_supplicant.conf

This might also be caused by some regression in the Linux kernel. If you switch to the net/wifibox-alpine port, you can optionally select Linux 6.8 (as "edge") in the options for building the guest image and test with that if the driver (and the related networking stack) of that version works better for you.

Besides all these, you can experiment with pre-configuring the ppt (PCI pass-through) driver for the device and boot the host that way. This requires the loader.conf to have the following line:

pptdevs="2/0/0"
sk-spr commented 4 months ago

I have some interesting news based on your comment: When manually restarting networking inside the guest (after getting into the console by means of wifibox console and wifibox restart guest) using service networking restart after restarting wpa_supplicant (whose logs show nothing abnormal), wlan0 no longer times out waiting for a carrier (see below)!

Originally, this was the (relevant section of the) log from service networking restart:

 * Starting networking ...
 *   lo ...
 [ ok ]
 *   eth0 ...
 [ ok ]
 *   wlan0 ...
dhcpcd-10.0.6 starting
wlan0: waiting for carrier
timed out
dhcpcd exited
ifup: failed to change interface wlan0 state to 'up'

however, after running service wpa_supplicant restart, and then service networking restart:

wifibox:~# service wpa_supplicant restart
 * Starting WPA Supplicant ...
 [ ok ]
wifibox:~# service networking restart
 * Starting networking ...
 *   lo ...
 [ ok ]
 *   eth0 ...
 [ ok ]
 *   wlan0 ...
dhcpcd-10.0.6 starting
DUID 00:01:00:01:2d:d3:a4:e4:90:e8:68:43:f1:69
wlan0: connected to Access Point: TP-Link_22F4
wlan0: IAID 68:43:f1:69
wlan0: adding address fe80::92e8:68ff:fe43:f169
wlan0: rebinding lease of 10.0.0.16
wlan0: soliciting an IPv6 router
wlan0: Router Advertisement from fe80::1
wlan0: requesting DHCPv6 information
wlan0: REPLY6 received from fe80::1
wlan0: refresh in 86400 seconds
wlan0: probing address 10.0.0.16/24
wlan0: leased 10.0.0.16 for 86400 seconds
wlan0: adding route to 10.0.0.0/24
wlan0: adding default route via 10.0.0.138

A quick check on the host confirms that the network is fully online:

freebsdbook% ping google.com
PING google.com (142.250.180.206): 56 data bytes
64 bytes from 142.250.180.206: icmp_seq=0 ttl=114 time=43.840 ms
64 bytes from 142.250.180.206: icmp_seq=1 ttl=114 time=54.271 ms
^C
--- google.com ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 43.840/49.055/54.271/5.215 ms
freebsdbox% ifconfig
[...]
wifibox0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
    options=0
    ether 58:9c:fc:10:eb:66
    inet 10.0.0.2 netmask 0xffffff00 broadcast 10.0.0.255
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 3 priority 128 path cost 2000000
    groups: bridge
    nd6 options=9<PERFORMNUD,IFDISABLED>
tap0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
    options=80000<LINKSTATE>
    ether 58:9c:fc:10:13:38
    groups: tap
    media: Ethernet 1000baseT <full-duplex>
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    Opened by PID 4127

Very interesting! I would be curious if this behaviour affects other configurations as well; if it does, perhaps a change in the guest's service init order might work, though I'd have to try this (and am not very familiar with openrc configuration, unfortunately...)

pgj commented 4 months ago

There is a slight chance the issue shows up for others as well. However, I am not sure if the problem is with the init order of the guest. The wpa_supplicant service declares that it has to be started before networking (i.e. provider of net).

What I can currently imagine is that during the boot process, WPA Supplicant fails to find the wlan0 device, it cannot start up, and therefore the networking service fails to work as well. But if wpa_supplicant is restarted at some later point, everything starts to work. Maybe the mt7921e driver takes a long time to initialize? What are the timestamps and messages for that in the dmesg output? Could you please add the output of the following command:

wifibox# dmesg | fgrep mt7921e

By the way, I noticed in the logs that your wireless network uses the 10.0.0.0/24, which is the same as what Wifibox uses by default to communicate with the host. This all appears to be working, but maybe it is worth to tell Wifibox to use a different subnet (e.g. 10.0.1.0/24 to avoid any potential problems caused by this overlap.

vvillenave commented 2 months ago

Greetings, I can verify the same behavior here with the same Mediatek card, wifibox 0.13.0.

Enabling the console and launching manually (with some delay, I guess) wpa_supplicant, then dhcpcd (interestingly, neither udhcpc nor udhcpc6 work, though BusyBox dhcpcd seems to invoke udhcpc), works around the behavior. Maybe a delayed retry (as in https://github.com/pgj/freebsd-wifibox-port/commit/954eff34efe6dd970470612602369dbdedbec096) would address the problem?

Alternatively, is there a mechanism through which one can specify a set of commands to be launched on the guest start (and possibly even on suspend/resume)? Since the guest image is read-only, I can’t modify the guest’s rc.conf or .bashrc or anything. The two ways I could see this implemented would be either as a squashfs overlay (think Puppy Linux or any LiveUSB with additional persistent storage), or more simply a script passed through a bhyve arg or variable (like with some Docker images), much like the passthru option setting. But I may be reinventing the wheel here...

pgj commented 2 months ago

Hi @vvillenave, thanks for the confirmation. The delayed retry (of WPA Supplicant) is already part of the development version of the port (as you pointed out above) so you or @sk-spr can try that.

There is no explicit support for specifying commands on start/stop/suspend/resume for the guest. First of all, the guest does not deal with suspend/resume, that is always translated to start/stop (host suspend: guest stop or no-op, host resume: guest start or restart).

Second, even though the guest image is read-only, you can modify how it is constructed in the net/wifibox-alpine port. For example, the pre-build target in the port's Makefile can be used to add extra files for the guest's root file system (which is sourced from _GUESTDIR during the build process). But, since the whole construction is based on the use of Alpine Linux packages for modularity, it is often better to create your own package with the planned modifications and add the package to the list of packages to use (_PACKAGES variable). Even further, the boot/shutdown process could be influenced by creating OpenRC scripts and hooking them up (e.g. _BOOT_SERVICES variable). An example of that is the Wifibox-custom uds_passthru package.

But in cases of problems like this one, it is better to find a general solution. That is, have it reported here and create a solution together that works for everybody. For example, the delayed retry I implemented for WPA Supplicant was designed in a way to possibly cover other scenarios, and with some luck, it can even provide a solution for this issue as well.

pgj commented 1 month ago

@sk-spr @vvillenave please try the latest development version of Wifibox, mostly notably the 20240821 (or later) version of the net/wifibox-alpine port from here:

https://github.com/pgj/freebsd-wifibox-port

I am inclined to assume that the exponential back-off for WPA supplicant (https://github.com/pgj/freebsd-wifibox-alpine/pull/25) solves the issue though. Please check it and let me know in the next two weeks. Then I am going to close this ticket as resolved unless I hear back otherwise.

sk-spr commented 1 month ago

@sk-spr @vvillenave please try the latest development version of Wifibox, mostly notably the 20240821 (or later) version of the net/wifibox-alpine port from here:

https://github.com/pgj/freebsd-wifibox-port

I am inclined to assume that the exponential back-off for WPA supplicant (https://github.com/pgj/freebsd-wifibox-alpine/pull/25) solves the issue though. Please check it and let me know in the next two weeks. Then I am going to close this ticket as resolved unless I hear back otherwise.

I will try and find the install, though I suspect it is on an external SSD that I may have wiped. I will let you know if it works on the case I still have the install around.

sk-spr commented 1 month ago

@pgj After reinstalling FreeBSD because I had indeed wiped the drive with the previous install, and setting up wifibox again, it appears to be fully working (!)

@vvillenave if it's working for you as well, I'd say this issue is ready to be closed 🤩

pgj commented 1 week ago

All right folks, based on the replies I have got so far, I assume that this issue has been fixed. Remember though that you can always come back and create a new ticket if the problem still persists in some form.

vvillenave commented 1 week ago

@pgj I can confirm that the updated Alpine image works out-of-the-box. Thanks very much!