lwfinger / rtw88

A backport of the Realtek Wifi 5 drivers from the wireless-next repo.
593 stars 175 forks source link

Random full lock-ups on HP 250 G3 / rtw_8821ce, on all recent kernel versions #69

Open mdartmann opened 2 years ago

mdartmann commented 2 years ago

Hi,

when using your driver and connected to a wireless network, I get frequent freezes, to a point where not even a SysRQ is possible. I have experienced the issue since Linux 5.13, on both my self-compiled rc kernel as well as Archs' linux-zen package.

So far, I have not been able to detect a pattern in the freezes, except for being caused by the WiFi driver as they only occur while connected to a wireless network and if I don't use an external network card.

I want to help you fix this. Please let me know what logs I can provide, I will work on trying to get a pattern to show. I will reply as soon as I can reproduce the crash and attach a syslog.

mdartmann commented 2 years ago

I managed to reproduce the crash. It occured as I had my system under heavy load and opened a few tabs in a browser. Syslog:

Oct  5 12:43:38 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:39 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:40 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:40 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:40 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:41 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:41 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:42 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:42 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:43 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:44 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1
Oct  5 12:43:48 finja kernel: rtw_8821ce 0000:02:00.0: timed out to flush queue 1

... and then it freezes. This is on 5.15-rc3. I am building rc4 as I'm writing this and will try to reproduce it again.

lwfinger commented 2 years ago

Those "timed out to flush queue" messages are fairly common, but I am unaware that they have caused a crash. Was the caps-lock light flashing, or did the system just become unresponsive? If the latter, is there a CPU running full blast? A loud system fan would be the clue. Are you running the kernel version of the driver, or the one from this repo. The "rtw_8821ce" entry in the logged messages suggest that you are using this one. If so, why? This one is just copied from the kernel, but it may have mistakes.

Ideally, I would like to see whatever is logged at the time of the crash. If you are getting a fatal BUG, that gets difficult. One way would be to open a terminal and run 'sudo dmesg -tw". Leave that window open while you do web browsing, etc. The problem is sizing the terminal and browser such that both are visible at all times. When the crash happens, something should be output to the terminal. Photograph the screen messages and post that photo.

mdartmann commented 2 years ago

The caps lock indicator is not flashing and the CPU is running at full blast (which isn't a lot, it's just an i3-7020u).

I use the version straight from the kernel, this isn't journald but syslog-ng, that might explain why it looks different.

I will try running dmesg and get back to you. Thank you for the quick response!

dubhater commented 2 years ago

I have the same problem: https://marc.info/?l=linux-wireless&m=162058892007833&w=4

hidjgr commented 2 years ago

I have a similar problem, also on 5.13, my computer freezes within a few minutes of being unplugged. Here is the output of journalctl -xe -p3 -b-1

Oct 11 00:03:33 hdshp kernel: rtw_8821ce 0000:03:00.0: failed to send h2c command
mdartmann commented 2 years ago

Update: I wasn't able to get any additional dmesg output. Let me know if there are an more verbose output modes I can enable somewhere.

dubhater commented 2 years ago

The module rtw_core has a parameter called debug_mask. It's a bit field where the low 16 bits enable various debugging messages. I tried it with the value 65535 to enable all the messages. This made the driver print lots and lots of extra messages, but the system didn't lock up anymore.

I don't know how useful those messages may be if your system does lock up.

lwfinger commented 2 years ago

I do not know either. There is likely a race condition that is avoided because your system is kept busy logging all that debug info. You might try with various other numbers. The ones that likely will work would be 1, 2, 4, 6, or 7. Values 1, 2, and 4 invoke PCI, TX, and RX debugging respectively. I would probably start with a value of 4. I have no idea how much log output any of those would provide.

NBoumakis commented 2 years ago

I had the same problem with the default driver (rtw_8821ce) provided by the kernel 5.13.0-20. The laptop is an HP 15-da1018nv. Looked around and found a bug report on the kernel bug tracker. Worked around it by disabling Power Management and setting Power Management default to Off.