radxa / meta-radxa

OpenEmbedded BSP Layer for the Radxa boards
GNU Lesser General Public License v3.0
9 stars 18 forks source link

Critical wireless problems with the products utilizing AW-NB197SM module #7

Closed MuratUrsavas closed 2 years ago

MuratUrsavas commented 2 years ago

We're seeing a definite pattern on our RockPi 4B based miners which utilizes AW-NB197SM module. Local tests has shown that a firmware/driver crash happens and then the module gets disconnected from the system and doesn't return back. Here you can find the system logs of such incident attached.

aw-nb197sm-wifi-disconnection-20220320-00.log

The crash happens at 1072nd second, around 8 mins after connecting to the AP at 576th second.

After making a wired connection with the device, ip link command shows the wlan0 device is down.

3: wlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DORMANT group default qlen 1000
    link/ether XX:XX:1d:e3:XX:XX brd ff:ff:ff:ff:ff:ff

Unfortunately module never recovers from this state and the link stays down forever if the device is not rebooted. None of the nmcli commands help to get the link back up. Even that would help, it wouldn't solve anything because our devices are working unattended and a big portion of them are working solely on wireless networks.

This is a pretty serious issue and we need all the help we can get very urgently.

Right now I'm trying to install the newest brcmfmac driver but couldn't succeed with the backports.

Here's the system info:

       _,met$$$$$gg.          rock@rockpi-4b 
    ,g$$$$$$$$$$$$$$$P.       -------------- 
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 10 (buster) aarch64 
 ,$$P'              `$$$.     Host: ROCK PI 4B 
',$$P       ,ggs.     `$$b:   Kernel: 4.4.154-116-rockchip-g86a614bc15b3 
`d$$'     ,$P"'   .    $$$    Uptime: 11 mins 
 $$P      d$'     ,    $$P    Packages: 773 (dpkg) 
 $$:      $$.   -    ,d$$'    Shell: bash 5.0.3 
 $$;      Y$b._   _,d$P'      Terminal: /dev/pts/0 
 Y$$.    `.`"Y$$$$P"'         CPU: (6) @ 1.416GHz 
 `$$b      "-.__              Memory: 120MiB / 1926MiB 
  `Y$$
   `Y$$.                                              
     `$$b.
       `Y$$b.
          `"Y$b._
MuratUrsavas commented 2 years ago

I have applied 20210315 release of the driver via buster-backports repo and it definitely solves the crash problem and keeps the WiFi up in my stress tests. The official images have to include latest brcmfmac driver.

MuratUrsavas commented 2 years ago

Unfortunately the problem has appeared in a long running test. The newer driver has improved the situation but not solved completely. This problem still needs further investigation.

shawaj commented 2 years ago

Below comment is copied from @MuratUrsavas in other communication:

Hello everyone,

I have update for you. TL;DR the driver and firmware have to be updated together to solve this issue. I was able to reproduce the problem reliably and approve the solution.

For the lost souls who would like to learn every detail, like me, here it is:

The problem is happening on AzureWave AW-NB197SM modules. Neither AP6212 nor AP6256 are effected from this problem. The root cause is the firmware has an issue, probably a SIGSEGV inside (we don't know it obviously) and it causes the firmware to crash. Unfortunately they have not designed a watchdog process inside. Not in hardware, not in software. Nothing. Therefore module gets stuck and none of the recovery attempts helps. Reloading any of the drivers would also fail and would cause losing the wireless interfaces completely, becuase at startup, the drivers couldn't speak with module (as it has already crashed) and they assume there is no hardware or not working at all.

In my tests I've used official Debian Radxa image, which has the same Kernel as Balena OS, 4.4. I was able to create the problem constantly. Sometimes in 8 minutes, sometimes in 40, but it was always there. Since the module firmware was crashing, we were losing all of the wireless connectivity, including Bluetooth.

The new firmware and driver package supplied by Infineon really helped and fixed the problem. The driver has improved iperf and ping performance but didn't solve the firmware crash issue. The firmware fixed the crash issue and also improved ping and iperf results a bit. The module is working fine right now.

Here's the latest driver package from Infineon: https://community.infineon.com/t5/Wi-Fi-Bluetooth-for-Linux/Cypress-Linux-WiFi-Driver-Release-FMAC-2021-10-20/td-p/322639

And the latest firmware blob is attached.

I hope we can update the OS as fast as possible because the problem is quite severe.

Cheers

bcm43438-7.46.58.11.zip

shawaj commented 2 years ago

And another one from @MuratUrsavas :

Hi,

I'm in touch with Infineon and they are recommending (as of now) the driver and firmware set below:

https://community.infineon.com/t5/Wi-Fi-Bluetooth-for-Linux/Cypress-Linux-WiFi-Driver-Release-FMAC-2021-05-27/td-p/277394

You can find both an FMAC driver release and a firmware release pack in that page. For the module, they are recommending cyfmac43430-sdio.bin file.

Right now OS is using /system/etc/firmware/fw_bcm43438a1.bin as firmware file. From the naming, AzureWave's attached file fits better than Infineon's suggested file. But they are the developer and manufacturers of that chip. So we have to trust them. But please, make sure you have tried both files and measured their performance before the release.

FYI, I've tested and approved the attached one.

Cheers

floion commented 2 years ago

Just a note that we have this fixed here: https://github.com/balena-os/balena-rockpi/pull/57

floion commented 2 years ago

@jack-ma you may still want to pull the changes from https://github.com/balena-os/balena-rockpi/pull/57 in your BSP considering other people that don't use balenaOS will not have these improvements available.

RadxaYuntian commented 2 years ago

I'll take a look at this on Monday.

RadxaYuntian commented 2 years ago

So I incorporated the change from balena-os/balena-rockpi#57 in #23. The test image is available at here. Currently pining router for more than 1000s without issues on NB197. I'll also test with AP6256 soon but @floion can you take a look before I merge it into the main repo.

@MuratUrsavas @shawaj sorry for the late response. We generally don't take support questions in GitHub but via our forum, Discord channel, or email. With that being said, our official Debian/Ubuntu image for 4B has been updated to use 5.10 kernel. Can you check if the latest image fixes this issue for you?

MuratUrsavas commented 2 years ago

@RadxaYuntian 5.10 kernel is definitely good news. Will check the latest stock Debian image later and let you know.