monome / norns-image

79 stars 16 forks source link

Wifi (Realtek 8192cu) issues #64

Closed simonvanderveldt closed 5 years ago

simonvanderveldt commented 5 years ago

This issue is about the problems with the included wifi nub, purely on an OS level. The integration with matron/the menu is out of scope. Same for hotspot related issues.

We've been having quiet a lot of reports from users and experiences ourselves with the wifi being unreliable. We should figure out why that is and try to find a solution to make it work reliably.

Symptoms seem a bit random, for example the network connection works fine (ping google.com no drops), reboot, works fine, reboot again, works fine, reboot another time, it doesn't work anymore.

simonvanderveldt commented 5 years ago

One observation is that there are several drivers one can use for the chipset used in the usb nub (Realtek 8192cu).

I'd normally choose to use the upstream kernel driver (rtl8192cu) but there has been some older/ongoing discussion on the Raspberry Pi kernel about which to use with some people reporting issues with the rtl8192cu driver. See https://github.com/raspberrypi/linux/pull/1468 and https://github.com/raspberrypi/linux/pull/1525 for examples. I don't really care about the LED, but the issues mentioned here would at the very least need some decent testing to check if they have been fixed in the meantime https://github.com/raspberrypi/linux/pull/1525#issuecomment-225876599 It has received some commits since that report https://github.com/raspberrypi/linux/commits/rpi-4.19.y/drivers/net/wireless/realtek/rtlwifi/rtl8192cu

Since we haven't actively changed anything ourselves we're currently using the 8192cu driver.

simonvanderveldt commented 5 years ago

One thing I've observed is that in cases where the wifi didn't work after a restart/boot was that for currently unknown reasons the wireless connection gets soft blocked by rfkill

we@norns:~$ rfkill list
2: phy2: Wireless LAN
        Soft blocked: yes
        Hard blocked: no

Unblocking rfkill unblock all and enabling the device using NetworkManager nmcli radio wifi on makes it work again.

Also of interest is the systemd-rfkill.service systemd service which saves the rfkill state on state change and restores it on boot. Not sure if this fails to restore it in the failure cases or if something else is broken.

I also tried rfkill.default_state=1 as kernel command line option, this should default all radios to unblocked (see https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt) but that didn't seem to do anything. Have to investigate that a bit more.

simonvanderveldt commented 5 years ago

Just dropping some notes here, will investigate more tomorrow:

One thing that seems of is the states as written by systemd-rfkill.service, this might or might now at least partially explain why different USB ports sometimes give different results

we@norns:~$ cat /var/lib/systemd/rfkill/platform-3f980000.usb-usb-0\:1.1\:1.0\:wlan 
0
we@norns:~$ cat /var/lib/systemd/rfkill/platform-3f980000.usb-usb-0\:1.2\:1.0\:wlan 
0
we@norns:~$ cat /var/lib/systemd/rfkill/platform-3f980000.usb-usb-0\:1.3\:1.0\:wlan 
1
we@norns:~$ cat /var/lib/systemd/rfkill/platform-3f980000.usb-usb-0\:1.4\:1.0\:wlan 
1

These files are used by the systemd-rfkill service to restore the rfkill state. 1.2 is the USB port next to the UART on my prototype PCB, 1.3 the middle one and 1.4 the USB port next to the audio jacks. Interestingly enough in all 3 ports it's marked as soft blocked after startup.

On another layer/item of the whole networking stack NetworkManager also has a state file and that also has wireless disabled

we@norns:~$ cat /var/lib/NetworkManager/NetworkManager.state 
[main]
NetworkingEnabled=true
WirelessEnabled=false
WWANEnabled=true

All of this is after a clean/first boot using the 190303 (2.0 beta) image.

simonvanderveldt commented 5 years ago

OK, some low-level details just to understand the whole stack:

rfkill

rfkill starts off with all radios unblocked by default because the rfkill.default_state module parameter defaults to 1 in the code. 1 means unblocked

rfkill.default_state= 0 "airplane mode". All wifi, bluetooth, wimax, gps, fm, etc. communication is blocked by default. 1 Unblocked.

On init this value is then used to initialize all rfkill blocks to unblocked.

systemd-rfkill

Once systemd starts it will also start systemd-rfkill (via socket activiation using systemd-rfkill.socket). As mentioned before this restores the rfkill state of individual devices from /var/lib/systemd/rfkill. If the last state before shutdown was blocked it will be restored to blocked, if it was unblocked it will be restored to unblocked.

NetworkManager

NetworkManager pretty much ignores the softblock and purely acts based on it's desired state for the wifi. The state that's used on boot is the state as defined in /var/lib/NetworkManager/NetworkManager.state, with the WirelessEnabled field being the relevant one. It's state (true for on, false for off) will make NetworkManager change the rfkill state as well. Example log with WirelessEnabled=false:

we@norns:~$ journalctl | grep -i kill
Mar 08 23:14:54 norns systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Mar 08 23:14:55 norns systemd[1]: Starting Load/Save RF Kill Switch Status...
Mar 08 23:14:55 norns systemd[1]: Started Load/Save RF Kill Switch Status.
Mar 08 23:14:56 norns NetworkManager[324]: <info>  [1552061696.5589] rfkill0: found WiFi radio killswitch (at /sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.4/1-1.4:1.0/ieee80211/phy0/rfkill0) (driver rtl8192cu)
Mar 08 23:14:57 norns NetworkManager[324]: <info>  [1552061697.1539] manager: WiFi enabled by radio killswitch; disabled by state file
Mar 08 23:14:57 norns NetworkManager[324]: <info>  [1552061697.1542] manager: WWAN enabled by radio killswitch; enabled by state file
Mar 08 23:14:57 norns NetworkManager[324]: <info>  [1552061697.3180] manager: WiFi now disabled by radio killswitch

So we start with wifi being enabled by rfkill but NetworkManager then turns it off/soft blocks it because /var/lib/NetworkManager/NetworkManager.state says WirelessEnabled=false.

This desired state can be changed using for example nmcli radio wifi on. This will unblock the rfkill soft block and set WirelessEnabled=true in /var/lib/NetworkManager/NetworkManager.state.

simonvanderveldt commented 5 years ago

So I've been trying the wifi connection using the (current default) 8192cu driver through a multitude of reboots and shutdowns and the wifi connection has been working 100% of the time for me.

The only thing I changed vs the stock 190303 image was setting WirelessEnabled=true in /var/lib/NetworkManager/NetworkManager.state whilst NetworkManager wasn't running (otherwise it will overwrite the state file again). After that I rebooted and didn't touch NetworkManager anymore. This action can probably be replaced by running nmcli radio wifi on instead.

I'll try several cycles of disconnecting and connecting as well as disabling wifi next.

tehn commented 5 years ago

fantastic.

what would be the best way to update the image with nmcli radio wifi on?

simonvanderveldt commented 5 years ago

what would be the best way to update the image with nmcli radio wifi on?

Once I know for sure it's working 100% of the time I'll make a PR (if necessary) to make the required changes.

ngwese commented 5 years ago

Just to be clear, the change being discussed here w.r.t keeping the radios enabled is unrelated to whether the Realtek nub works in HOTSPOT/AP mode correct?

simonvanderveldt commented 5 years ago

Just to be clear, the change being discussed here w.r.t keeping the radios enabled is unrelated to whether the Realtek nub works in HOTSPOT/AP mode correct?

Correct, this is just about regulier client (station) wifi usage, not hotspot (AP).

simonvanderveldt commented 5 years ago

So I can 100% reliably get a connection going by starting from the latest image (190314, although it also worked fine on the older ones) running nmcli radio wifi on and then connecting via nmtui. Then on shutdown viasleepwifi gets turned off and when I start again wifi is still off, I go tosystem > wifi`, it's takes a bit of time and then wifi is up and running again. This seems to cover at least the very basic use case of being able to connect to wifi, which on my device with my router (you never know with wifi stuff) works reliably from an os perspective. I haven't tried this whole workflow via the screen/matron, will do that tomorrow.

Now, unfortunately I can also reliably break the wifi driver by simply running nmcli radio wifi off, this results in the following stack trace

Mar 16 08:10:18 norns avahi-daemon[287]: Interface wlan0.IPv6 no longer relevant for mDNS.
Mar 16 08:10:18 norns kernel: ------------[ cut here ]------------
Mar 16 08:10:18 norns kernel: WARNING: CPU: 2 PID: 61 at net/wireless/sme.c:981 __cfg80211_disconnected+0x1bc/0x1d4 [cfg80211]
Mar 16 08:10:18 norns kernel: Modules linked in: 8192cu cfg80211 rfkill evdev rotary_encoder bcm2835_gpiomem uio_pdrv_genirq uio gpio_keys fixed ip_tables x_tables ipv6
Mar 16 08:10:18 norns kernel: CPU: 2 PID: 61 Comm: kworker/u8:1 Not tainted 4.14.52-13-g5b46619 #1
Mar 16 08:10:18 norns kernel: Hardware name: BCM2835
Mar 16 08:10:18 norns kernel: Workqueue: cfg80211 cfg80211_event_work [cfg80211]
Mar 16 08:10:18 norns kernel: [<8010ee7c>] (unwind_backtrace) from [<8010b6e4>] (show_stack+0x10/0x14)
Mar 16 08:10:18 norns kernel: [<8010b6e4>] (show_stack) from [<8066c298>] (dump_stack+0x88/0xa8)
Mar 16 08:10:18 norns kernel: [<8066c298>] (dump_stack) from [<80118dd4>] (__warn+0xe8/0x100)
Mar 16 08:10:18 norns kernel: [<80118dd4>] (__warn) from [<80118e9c>] (warn_slowpath_null+0x20/0x28)
Mar 16 08:10:18 norns kernel: [<80118e9c>] (warn_slowpath_null) from [<7f1b4560>] (__cfg80211_disconnected+0x1bc/0x1d4 [cfg80211])
Mar 16 08:10:18 norns kernel: [<7f1b4560>] (__cfg80211_disconnected [cfg80211]) from [<7f1922ac>] (cfg80211_process_wdev_events+0x134/0x13c [cfg80211])
Mar 16 08:10:18 norns kernel: [<7f1922ac>] (cfg80211_process_wdev_events [cfg80211]) from [<7f1922e0>] (cfg80211_process_rdev_events+0x2c/0x60 [cfg80211])
Mar 16 08:10:18 norns kernel: [<7f1922e0>] (cfg80211_process_rdev_events [cfg80211]) from [<7f18d164>] (cfg80211_event_work+0x14/0x1c [cfg80211])
Mar 16 08:10:18 norns kernel: [<7f18d164>] (cfg80211_event_work [cfg80211]) from [<8012f7e0>] (process_one_work+0x1d8/0x414)
Mar 16 08:10:18 norns kernel: [<8012f7e0>] (process_one_work) from [<8013052c>] (worker_thread+0x4c/0x598)
Mar 16 08:10:18 norns kernel: [<8013052c>] (worker_thread) from [<801352e4>] (kthread+0x14c/0x154)
Mar 16 08:10:18 norns kernel: [<801352e4>] (kthread) from [<80107bc8>] (ret_from_fork+0x14/0x2c)
Mar 16 08:10:18 norns kernel: ---[ end trace f30ffe7b67b6b1e0 ]---

@artfwo has tested the same thing and get the same behavior. After this the wifi nub/driver is no longer functional, no connection can be made. The main message related to this is from wpa_supplicant:

norns wpa_supplicant[376]: wlan0: Association request to the driver failed

I'm going to experiment with some combinations of actions to see if we can turn off wifi without issues and maybe try a trace of the wifi driver to see if I can figure out what's going wrong. At the moment I'm not sure we can solve this. If not we can either simply not have the option to turn the wifi off or we/I could try a different driver. Btw after a restart everything is back to working properly again as described above.

simonvanderveldt commented 5 years ago

Next update: A simple nmcli connection down is already a enough to get into a non-functional state. There's no kernel stack trace but the result is the same, no longer able to connect with the same message

wpa_supplicant[375]: wlan0: Association request to the driver failed

Initially I thought it was some incompatibility of the 8192cu driver with NetworkManager but I can't even get it to work when using wpa_supplicant directly. I.e.

# disable the NetworkManager service
# wpa_supplicant service
# configure `/etc/wpa_supplicant/wpa_supplicant.conf` with SSID and credentials
# Now connect using wpa_supplicant
sudo wpa_supplicant -B -i wlan0 -f /var/log/wpa_supplicant.log -c /etc/wpa_supplicant/wpa_supplicant.conf
# Get a lease
sudo dhclient -v wlan0
# Now we have a working connecting
# Kill wpa_supplicant
killall wpa_supplicant
# Try to connect again
sudo wpa_supplicant -B -i wlan0 -f /var/log/wpa_supplicant.log -c 
# And we get in the logs
cat /var/log/wpa_supplicant.log
...
wlan0: Association request to the driver failed
...

I think I'm going to call this part of the investigation a day and conclude that the 8192cu driver is a bunch of crap. We can keep using it but then we should remove the option to turn off the wifi from the menu.

I'll do a quick try of the other two drivers to see if they work better.

tehn commented 5 years ago

shame about the driver.

for our purposes (and general user experience) i'm concerned primarily with getting it turned on.

i agree that the OFF option can be removed.

reiterating here that if people don't want WIFI, remove the nub!

simonvanderveldt commented 5 years ago

Quick update, I tried out the rtl8xxxu driver (which needs CONFIG_RTL8XXXU_UNTESTED=y enabled in the kernel config because it's untested).

Unfortunately it doesn't seem to work. The firmware is there and gets loaded but it then throws an error that it failed to write some block (coming from this write_error function)

Mar 17 23:09:25 norns mtp-probe[231]: checking bus 1, device 3: "/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.2"
Mar 17 23:09:25 norns mtp-probe[231]: bus: 1, device: 3 was not an MTP device
Mar 17 23:09:25 norns systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Mar 17 23:09:25 norns kernel: usb 1-1.2: Vendor: Realtek
Mar 17 23:09:25 norns kernel: usb 1-1.2: Product: 802.11n WLAN Adapter
Mar 17 23:09:25 norns kernel: usb 1-1.2: rtl8192cu_parse_efuse: dumping efuse (0x80 bytes):
Mar 17 23:09:25 norns kernel: usb 1-1.2: 00: 29 81 00 74 cd 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 08: ff 00 da 0b 76 81 03 41
Mar 17 23:09:25 norns kernel: usb 1-1.2: 10: 32 00 85 62 9e ad 40 a5
Mar 17 23:09:25 norns kernel: usb 1-1.2: 18: ef 0e e7 7e 0a 03 52 65
Mar 17 23:09:25 norns kernel: usb 1-1.2: 20: 61 6c 74 65 6b 00 16 03
Mar 17 23:09:25 norns kernel: usb 1-1.2: 28: 38 30 32 2e 31 31 6e 20
Mar 17 23:09:25 norns kernel: usb 1-1.2: 30: 57 4c 41 4e 20 41 64 61
Mar 17 23:09:25 norns kernel: usb 1-1.2: 38: 70 74 65 72 00 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 40: 00 00 00 00 00 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 48: 00 00 00 00 00 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 50: 01 00 00 00 00 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 58: 06 00 22 22 22 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 60: 22 22 22 00 00 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 68: 00 00 00 00 00 00 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 70: 00 00 00 00 00 0a 00 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: 78: 10 00 00 10 36 00 81 00
Mar 17 23:09:25 norns kernel: usb 1-1.2: RTL8188CU rev A (TSMC) 1T1R, TX queues 2, WiFi=1, BT=0, GPS=0, HI PA=0
Mar 17 23:09:25 norns kernel: usb 1-1.2: RTL8188CU MAC: 40:a5:ef:0e:e7:7e
Mar 17 23:09:25 norns kernel: usb 1-1.2: rtl8xxxu: Loading firmware rtlwifi/rtl8192cufw_TMSC.bin
Mar 17 23:09:25 norns kernel: usb 1-1.2: Firmware revision 80.0 (signature 0x88c1)
Mar 17 23:09:25 norns kernel: usb 1-1.2: rtl8xxxu_writeN: Failed to write block at addr: 1280 size: 0080
...
Mar 17 23:09:26 norns kernel: rtl8xxxu: probe of 1-1.2:1.0 failed with error -11
Mar 17 23:09:26 norns kernel: usbcore: registered new interface driver rtl8xxxu

After this whilst the driver is loaded no device is actually registered

I don't think we can expect any changes with a newer kernel either since the changes between the current 4.19 raspberry kernel and ours are pretty limited https://github.com/raspberrypi/linux/commits/d602de8e7e7fc25fb3a2112ce4285962f15aa549/drivers/net/wireless/realtek/rtl8xxxu, 4 commits that mainly fix typos.

So I guess the rtl8xxxu driver is out.

simonvanderveldt commented 5 years ago

OK, some good news. The rtl8192cu driver seems to work pretty well from the network management perspective. Just went through a cycle of enabling/disabling the wifi 10 times and it worked perfectly every time. Also did a quick hotspot test and that also worked first time + I could easily switch back and forth between client/station and hotspot mode :)

we@norns:~$ nmcli con
NAME              UUID                                  TYPE             DEVICE 
<my SSID>         59d5b2f9-0c60-4340-aa4a-07ab24f55c41  802-11-wireless  wlan0  
Hotspot           71ad3595-2448-4cf3-8d55-a4263c28ab3e  802-11-wireless  --     
we@norns:~$ nmcli con down 59d5b2f9-0c60-4340-aa4a-07ab24f55c41
Connection '<my SSID>' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/16)
we@norns:~$ nmcli con up 71ad3595-2448-4cf3-8d55-a4263c28ab3e
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/17)
we@norns:~$ nmcli con
NAME              UUID                                  TYPE             DEVICE 
Hotspot           71ad3595-2448-4cf3-8d55-a4263c28ab3e  802-11-wireless  wlan0  
<my SSID>         59d5b2f9-0c60-4340-aa4a-07ab24f55c41  802-11-wireless  --     

[edit] I also went through a plug/unplug cycle back and forth through all USB ports whilst norns was connected and in all cases the connection recovered within seconds of plugging in the nub.

Now there are still these old reports of issues with this driver https://github.com/raspberrypi/linux/pull/1525#issuecomment-225876599, not sure if they are still relevant and also not 100% sure how to determine if they are.

ngwese commented 5 years ago

Do you have a recipe/kernel build I can test w/ the rtl8192cu driver? If so I can probably test it over the next day or so and confirm your findings.

simonvanderveldt commented 5 years ago

Do you have a recipe/kernel build I can test w/ the rtl8192cu driver? If so I can probably test it over the next day or so and confirm you findings.

@ngwese If you want to try this: As long as you're on any of the 2.0 beta images/updates you should be good to test this, no need for a different kernel since the driver is already included in the current kernel. It's just blacklisted at the moment :)

So there's one change required, edit /etc/modprobe.d/blacklist-rtl8192cu.conf and replace blacklist rtl8192cu with blacklist 8192cu and then shutdown and start norns again (also make sure the norns wifi nub is in a USB port of course :wink: ) It should now be using this different driver for the wifi nub, check that rtl8192cu is loaded via lsmod:

we@norns:~$ lsmod
Module                  Size  Used by
arc4                   16384  2
rtl8192cu              86016  0
rtl_usb                20480  1 rtl8192cu
rtl8192c_common        61440  1 rtl8192cu
rtlwifi                90112  3 rtl_usb,rtl8192c_common,rtl8192cu
mac80211              421888  3 rtl_usb,rtlwifi,rtl8192cu
cfg80211              274432  2 mac80211,rtlwifi
rfkill                 24576  2 cfg80211
evdev                  24576  4
rotary_encoder         16384  0
bcm2835_gpiomem        16384  0
gpio_keys              20480  0
uio_pdrv_genirq        16384  0
uio                    16384  1 uio_pdrv_genirq
fixed                  16384  0
ip_tables              24576  0
x_tables               28672  1 ip_tables
ipv6                  401408  24

Now you should be able to connect either via system > wifi or using nmtui via SSH/serial console. It could be you first not to bring up the radio in NetworkManager via nmcli radio wifi on. Initially I'm just looking for feedback on normal usage where norns connects to your accesspoint/router. If that works well I'd like to check hotspot as well.

simonvanderveldt commented 5 years ago

Just ran a 4 hour test using the rtl8192cu driver doing a simple ping (not the best test ever, but alas) and no packet loss at all :) Quiet a bit of difference in the response times, but not sure if that's related to the wifi nub or wifi in general.

--- www.google.nl ping statistics ---
14634 packets transmitted, 14626 received, 0% packet loss, time 14653627ms
rtt min/avg/max/mdev = 9.015/17.005/643.364/21.181 ms
simonvanderveldt commented 5 years ago

I think we can close this for now. If issues pop up with the new driver we can create a new issue and determine where to go from there.