pgj / freebsd-wifibox

wifibox: Use Linux to drive your wireless card on FreeBSD
BSD 2-Clause "Simplified" License
156 stars 12 forks source link

wlan0 interface inside VM no longer comes up after a few times doing suspend/resume #31

Closed fearedbliss closed 11 months ago

fearedbliss commented 2 years ago

Hello,

I started to experiment with wifibox yesterday and it works pretty well upon a fresh boot. Once I start suspend/resuming (around 2-3 times), the internal "wlan0" interface in the VM no longer comes back up. I've done a few experiments to see what's going on. I don't think this is a bug with wifibox, but rather either bhyve's PCI passthrough code (and FLR). Restarting 'vmm' has no effect once my machine enters this state. Only a full reboot fixes it. I also did make sure that the devd suspend/resume workaround was being triggered by adding a "logger" statement and checking /var/log/messages afterwards. Reporting here for more exposure in this project. We could escalate it up to FreeBSD as needed.

Running On (but probably will also happen on 14-CURRENT):

jon@leslie:~ $ uname -a
FreeBSD leslie 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n251926-488f9d85278: Mon Jul 25 21:01:52 EDT 2022     root@leslie:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

root@leslie:~ # pkg info | grep wifibox
wifibox-1.1.1                  Wireless card driver via virtualized Linux
wifibox-alpine-20220712        Wifibox guest based on Alpine Linux
wifibox-core-0.10.0            Wifibox core functionality

This is the card information if I were to use the 'if_iwlwifi' module:

Jul 26 10:28:16 leslie kernel: Intel(R) Wireless WiFi based driver for FreeBSD
Jul 26 10:28:16 leslie kernel: iwlwifi0: <iwlwifi> mem 0xea238000-0xea23bfff at device 20.3 on pci0
Jul 26 10:28:16 leslie kernel: iwlwifi0: could not load firmware image 'iwlwifi-QuZ-a0-jf-b0-72.ucode'
Jul 26 10:28:16 leslie kernel: iwlwifi0: File size way too small!
Jul 26 10:28:16 leslie kernel: iwlwifi0: successfully loaded firmware image 'iwlwifi-QuZ-a0-jf-b0-71.ucode'
Jul 26 10:28:16 leslie kernel: iwlwifi0: api flags index 2 larger than supported by driver
Jul 26 10:28:16 leslie kernel: iwlwifi0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
Jul 26 10:28:16 leslie kernel: iwlwifi0: loaded firmware version 71.058653f6.0 QuZ-a0-jf-b0-71.ucode op_mode iwlmvm
Jul 26 10:28:16 leslie kernel: iwlwifi0: Detected Intel(R) Wireless-AC 9560 160MHz, REV=0x351
Jul 26 10:28:42 leslie kernel: iwlwifi0: Detected RF JF, rfid=0x105110
Jul 26 10:28:42 leslie kernel: iwlwifi0: base HW address: f8:e4:e3:eb:35:02

I've noticed that there is a higher chance for this situation to happen once I start getting FLR transaction messages:

Jul 26 11:16:08 leslie kernel: bridge0: Ethernet address: 58:9c:fc:10:ff:99
Jul 26 11:16:08 leslie kernel: bridge0: changing name to 'wifibox0'
Jul 26 11:16:08 leslie kernel: tap0: Ethernet address: 58:9c:fc:10:c0:3f
Jul 26 11:16:08 leslie kernel: tap0: promiscuous mode enabled
Jul 26 11:16:08 leslie kernel: wifibox0: link state changed to DOWN
Jul 26 11:16:08 leslie kernel: ppt0 mem 0xea238000-0xea23bfff at device 20.3 on pci0
Jul 26 11:16:08 leslie kernel: tap0: link state changed to UP
Jul 26 11:16:08 leslie kernel: wifibox0: link state changed to UP
Jul 26 11:16:08 leslie kernel: pci0:0:20:3: Resetting with transactions pending after 50 ms
Jul 26 11:16:08 leslie kernel: pci0:0:20:3: Transactions pending after FLR!

Host results (At this point this layer is correct but doesn't matter because the wlan0 inside the VM is gone.)

root@leslie:~ # ifconfig
em0: flags=8862<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=481049b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,VLAN_HWFILTER,NOMAP>
    ether f8:75:a4:ef:9d:81
    media: Ethernet autoselect
    status: no carrier
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    inet6 ::1 prefixlen 128
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
    inet 127.0.0.1 netmask 0xff000000
    groups: lo
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wifibox0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether 58:9c:fc:10:ff:99
    inet 10.1.0.2 netmask 0xffffff00 broadcast 10.1.0.255
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
            ifmaxaddr 0 port 4 priority 128 path cost 2000000
    groups: bridge
    nd6 options=9<PERFORMNUD,IFDISABLED>
tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=80000<LINKSTATE>
    ether 58:9c:fc:10:c0:3f
    groups: tap
    media: Ethernet autoselect
    status: active
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    Opened by PID 14824

VM networking results upon a bad resume

wifibox:~# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:A0:98:8A:05:71  
          inet addr:10.1.0.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::2a0:98ff:fe8a:571/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:2428 (2.3 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Attempting to bring up the wlan0 interface in the VM.

wifibox:~# service networking restart
 * WARNING: you are stopping a boot service
 * WARNING: you are stopping a boot service
 * Stopping Unix Domain Socket pass-through ...
 [ ok ]
 * Stopping busybox udhcpd ...
 [ ok ]
 * Stopping networking ...
 *   lo ...
 [ ok ]
 *   eth0 ...
 [ ok ]
 * Starting networking ...
 *   lo ...
 [ ok ]
 *   eth0 ...
 [ ok ]
 *   wlan0 ...
ip: SIOCGIFFLAGS: No such device
udhcpc: SIOCGIFINDEX: No such device
ifup: failed to change interface wlan0 state to 'up'
 [ !! ]
wifibox:~#  * Starting Unix Domain Socket pass-through ...
 * Starting busybox udhcpd ...
 [ ok ]
 [ ok ]

Check if ppt0 is attached to the pci device after we started wifibox (As we can see, it is and that's correct).

root@leslie:~ # pciconf -lv
ppt0@pci0:0:20:3:       class=0x028000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x02f0 subvendor=0x8086 subdevice=0x0030
    vendor     = 'Intel Corporation'
    device     = 'Comet Lake PCH-LP CNVi WiFi'
    class      = network

Shutting down the VM should clear the ppt driver from the wifi card. "The PPT device could not be destroyed" message always pops up even if the resume didn't break the connection. It usually takes 2-3 times before it breaks.

root@leslie:~ # service wifibox stop
Stopping wifibox....WARNING: /var/log/wifibox.log is not writeable, messages could not be logged.
WARNING: The PPT device could not be destroyed.
...OK

root@leslie:~ # pciconf -lv
none3@pci0:0:20:3:      class=0x028000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x02f0 subvendor=0x8086 subdevice=0x0030
    vendor     = 'Intel Corporation'
    device     = 'Comet Lake PCH-LP CNVi WiFi'
    class      = network

There was also an instance where I was able to load the 'if_iwlwifi' module after the wlan0 broke in the VM, and was able to use wireless on the host directly. However, this isn't guaranteed and eventually I'll start getting "ifconfig: SIOCIFCREATE2: Device not configured" messages, preventing the wlan0 interface from even starting up.

Attached is part of my /var/log/messages file that may help:

report.log