lwfinger / rtw88

A backport of the Realtek Wifi 5 drivers from the wireless-next repo.
643 stars 183 forks source link

WiFi stops with "timed out to flush pci tx ring" in dmesg #102

Closed PPyshkin closed 2 years ago

PPyshkin commented 2 years ago

My system is LinuxMint 20.2. Laptop HP ProBook 455 G7 (1F3M4EA), WiFi chip is RTL8822CE. Time after time my WiFi stops and only reboot helps. When it happens I have WiFi connection in tray, but there is no real connection even with local router. When I try switch off WiFi in tray and then switch on I can not connect to router, error appears. Only after reboot WiFi works again.

This problem appeared very often with 5 GHz WiFi and Kernels before 5.17.0. In this combination WiFi stoped even when I visited speedtest site. It was impossible to pass the speed test. It looked like wifi card "choked" with huge data flow.

It was for 5.15... 5.16... Kernels, but after 5.17.0 situation became better. The stopping of WiFi appears once per day or per few days. My record is 10 days of stable work before WiFi stoped. Of course, I googled about this problem and tried to do everything. For example, disabled power management. I compiled new Kernels, now I have 6.0 Kernel. But the situation was the same. At last, I install these drivers. The result is 4 days of working before WiFi stopped.

Every time when WiFi stopped I see the similar thing when run dmesg command. For example, now, with new drivers: ....

[418785.149288] wlp2s0: authenticate with 6c:ba:b8:36:eb:06
[418785.175388] wlp2s0: Wrong control channel: center-freq: 5500 ht-cfreq: 5300 ht->primary_chan: 60 band: 1 - Disabling HT
[418785.689351] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 1/3)
[418785.693121] wlp2s0: authenticated
[418785.693527] wlp2s0: associating to AP 6c:ba:b8:36:eb:06 with corrupt beacon
[418785.707382] wlp2s0: associate with 6c:ba:b8:36:eb:06 (try 1/3)
[418785.710751] wlp2s0: RX AssocResp from 6c:ba:b8:36:eb:06 (capab=0x1511 status=0 aid=54)
[418785.711080] wlp2s0: associated
[418785.772616] wlp2s0: Limiting TX power to 30 (30 - 0) dBm as advertised by 6c:ba:b8:36:eb:06
[429446.183376] rtw_8822ce 0000:02:00.0: timed out to flush queue 1
[429484.935371] rtw_8822ce 0000:02:00.0: timed out to flush queue 1
[429485.087403] rtw_8822ce 0000:02:00.0: timed out to flush queue 1
[433161.119634] input: WI-C200 (AVRCP) as /devices/virtual/input/input51
[433208.039535] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435127.696255] rtw_8822ce 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xfebf3840 flags=0x0000]
[435137.009872] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435137.009977] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435137.010109] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435137.010208] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435137.339506] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435137.339622] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435137.339731] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435137.339873] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435137.339983] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435137.340087] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435137.675488] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435137.675604] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435137.675712] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435137.675842] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435137.675946] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435137.676045] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]

After I switch off WiFi in tray and then switch on, I can not connect to the router, and dmesg looks like:

[435350.431213] wlp2s0: deauthenticating from 6c:ba:b8:36:eb:06 by local choice (Reason: 3=DEAUTH_LEAVING)
[435350.431746] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435350.431850] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435350.431949] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435350.432092] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435350.432191] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435350.432290] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435350.452060] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435350.452173] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435350.452278] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435350.452502] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435350.452614] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435350.452723] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435350.491974] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]
[435350.492090] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[3]
[435350.492198] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[5]
[435357.342455] wlp2s0: authenticate with 6c:ba:b8:36:eb:06
[435357.844959] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 1/3)
[435358.991461] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 2/3)
[435359.951655] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 3/3)
[435360.943384] wlp2s0: authentication with 6c:ba:b8:36:eb:06 timed out
[435365.318355] wlp2s0: authenticate with 6c:ba:b8:36:eb:06
[435365.667480] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 1/3)
[435366.991474] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 2/3)
[435367.951474] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 3/3)
[435368.975402] wlp2s0: authentication with 6c:ba:b8:36:eb:06 timed out
[435373.258139] wlp2s0: authenticate with 6c:ba:b8:36:eb:06
[435373.599481] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 1/3)
[435374.927646] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 2/3)
[435375.983618] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 3/3)
[435376.943373] wlp2s0: authentication with 6c:ba:b8:36:eb:06 timed out
[435387.301545] wlp2s0: authenticate with 6c:ba:b8:36:eb:06
[435387.643476] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 1/3)
[435388.943462] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 2/3)
[435389.967492] wlp2s0: send auth to 6c:ba:b8:36:eb:06 (try 3/3)
[435390.959381] wlp2s0: authentication with 6c:ba:b8:36:eb:06 timed out

This is a newbie question, but Is it possible somehow to "increase tx buffer" or do something with this?
I am not alone with this problem of "timed out to flush" with Realtek chips. And it looks like this problem is permanent in Linux world. And this is very sad.

lwfinger commented 2 years ago

This is NOT a permanent problem. I have been running an RTW8852BE for nearly a week without any network drops, and barely a message in the logs. There have been many tests with a local speedtest, which show results higher than the speed of my broadband connection (500 Mbps). The local host is on a 1000 Mbps link.

A recurring problem has been laptops with a BIOS that does not properly support the PCIe standard. Whant make and model do you have? Some users have found it necessary to use one or both of the following mocule options for

parm: disable_msi:Set Y to disable MSI interrupt support (bool) parm: disable_aspm:Set Y to disable PCI ASPM support (bool)

The options should be applied to rtw88_pci if you use the kernel version, or rtw_pci if the code comes from this repo. The safe thing to do is create an options file that handles both. I will let you research how to create an options file.

Note: Turning off the wifi in the tray and turning it back on uses a utility called rfkill, and only affects the user-space part of the connection. It does exactly nothing to the driver, thus when you turn it back on, you get exactly what you had before. What you need to do is open a terminal and type the following two commands:

sudo modprobe -rv rtw_8822ce sudo modprobe -r rtw_8852ce

That may save you a reboot as it gives you a fresh copy of the driver. On the other hand, if the chip is in a state that is only cured by turning off the power, then a reboot will be necessary.

I have not run my RTW8822CE for some time, but I recently did a long-term test of the RTW8821CE, which has been more problematic, with little difficulty. Those tests led to a patch that did fix a problem.

Note that the message does not say that the chip "ran out of TX buffers". It says that the chip is not responding when it tries to flush the buffers. Quite a different thing if the message is to be believed.

PPyshkin commented 2 years ago

Thank you for your answer.

It will be interesting to try

parm: disable_msi:Set Y to disable MSI interrupt support (bool)
parm: disable_aspm:Set Y to disable PCI ASPM support (bool)

Please, write algorithm for linux newbies how to do this.

Maybe the main reason of the problem is not in lines "timed out to flush pci tx ring" but in the line: [435127.696255] rtw_8822ce 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xfebf3840 flags=0x0000]

This message always appears before fatal "timed out to flush pci tx ring" lines when wifi stops.

I am not alone with this problem. For example: 1) The message from bugzilla:

Hello,

I noticed a problem with my rtw_8822ce wifi recently, it drops connection randomly. Hard reboot is required. From NetworkManager it looks like it is connected, but there is no internet connection.

In dmesg I have: [ 521.400458] rtw_8822ce 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0009 address=0xfbc81414 flags=0x0000]

When I am reconnection with NetworkManager in dmesg I have lots of these [ 629.991505] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[1]

but from logs you can observe that it is in fact connected

Full log: [ 45.038243] rtw_8822ce 0000:01:00.0: enabling device (0000 -> 0003) [ 45.044230] rtw_8822ce 0000:01:00.0: Firmware version 9.9.4, H2C version 15 [ 45.044485] rtw_8822ce 0000:01:00.0: Firmware version 9.9.10, H2C version 15 [ 45.282040] rtw_8822ce 0000:01:00.0 wlp1s0: renamed from wlan0 [ 46.264046] rtw_8822ce 0000:01:00.0: start vif 28:cd:c4:6b:a5:37 on port 0 [ 46.265701] rtw_8822ce 0000:01:00.0: stop vif 28:cd:c4:6b:a5:37 on port 0 [ 46.493493] rtw_8822ce 0000:01:00.0: start vif de:2a:1e:61:0c:c0 on port 0 [ 50.739107] rtw_8822ce 0000:01:00.0: stop vif de:2a:1e:61:0c:c0 on port 0 [ 50.967006] rtw_8822ce 0000:01:00.0: start vif 28:cd:c4:6b:a5:37 on port 0 [ 50.971579] rtw_8822ce 0000:01:00.0: stop vif 28:cd:c4:6b:a5:37 on port 0 [ 51.205701] rtw_8822ce 0000:01:00.0: start vif a6:28:ed:35:f0:f0 on port 0 [ 55.285903] rtw_8822ce 0000:01:00.0: stop vif a6:28:ed:35:f0:f0 on port 0 [ 55.515710] rtw_8822ce 0000:01:00.0: start vif 28:cd:c4:6b:a5:37 on port 0 [ 60.163246] rtw_8822ce 0000:01:00.0: sta 5a:ef:68:a6:25:96 joined with macid 0 [ 521.400458] rtw_8822ce 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0009 address=0xfbc81414 flags=0x0000] [ 629.991505] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[1] [ 629.991610] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[3] [ 629.991746] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[1] [ 629.991844] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[3] [ 630.262274] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[1] [ 630.262383] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[3] [ 630.262481] rtw_8822ce 0000:01:00.0: timed out to flush pci tx ring[5]

( https://bugzilla.kernel.org/show_bug.cgi?id=214097 )

2) The message from other forum about "AMD-Vi: Event logged" with similar symptoms and again - realtek:

This happens whenever under medium load. It is very consistent. When it occurs, there is no network connectivity until I ifconfig eth0 down/up, dhclient eth0. Then it works until I put a load on the network card again by just pushing some files back and forth. Basic Internet usage doesn't seem to trigger it, but higher speed traffic to/from a machine on the network and poof, gone. The network card still appears in lspci, ifconfig insists it is working, but no traffic will move until I reset it. Dmesg line is:

[ 1140.466954] AMD-Vi: Event logged [IO_PAGE_FAULT device=28:00.0 domain=0x000b address=0x0000000001091000 flags=0x0050]

Device 28:00:00 is:

28:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Slackware 64 bit, current patch level, 4.4.88 kernel.

A bit of googling turned up some old complaints of this, having to do with the nic driver.

I'm not quite sure where to go from here, other than trying a different (newer or older) kernel. Advice welcome.

( https://www.linuxquestions.org/questions/slackware-14/amd-vi-event-logged-%5Bio_page_fault-ethernet-controller-4175615079/ )

3) Again the same problem with realtek:

So I recently upgraded the internals of my Ubuntu Server so that it's running a new motherboard, new processor, etc. And ever since then, I've been getting errors that look like this: Code: AMD-Vi: Event logged [IO_PAGE_FAULT device=02:00.0 domain 0x001d address=0x0000000000003000 flags=0x0050] Sometimes they don't appear to do anything, however sometimes after encountering this, the network stops working properly.

I've searched for answers regarding this, however most are either unresolved or involve hardware that isn't relevant here (such as the graphics card causing the fault, and issues starting the desktop environment). Running Code: sudo lspci -vnn | grep 02:00.0 -A9 confirms this and lists info about the ethernet controller: Code: 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev. 0c) Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:7969] Flags: bus master, fast devsel, latency 0, IRQ 87 I/O ports at e000 [size=256] Memory at fe900000 (64-bit, non-prefetchable) [size=4K] Memory at d0800000 (64-bit, prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [70] Express Endpoint, MSI 01 Capabilities: [b0] MSI-X: Enable- Count=4 Masked- I've also read that there is probably an issue with IOMMU, however I have no way to control whether or not this is turned on in the BIOS (there is no option), and adding Code: iommu=pf or Code: iommu=soft doesn't have any effect on the problem. Is there anyone that could shed some light on this? The server is currently unusable, since the problem happens usually within 5 minutes of booting, and only a reboot can fix it.

( https://ubuntuforums.org/showthread.php?t=2290992 )


Sorry, maybe it is not good to flood with offtopic problems of other people. But I only want to say that this is not only my particular problem.

lwfinger commented 2 years ago

You could have conveyed the information with a link. Similarly, all the "435137.339873] rtw_8822ce 0000:02:00.0: timed out to flush pci tx ring[1]" messages after the "[435127.696255] rtw_8822ce 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xfebf3840 flags=0x0000]" caused me to miss the real error. An IO_PAGE_FAULT is a serious problem. I have been reading the linux wireless mailing list forever, and no one has ever posted this error.

Please read the last paragraph of README.md and file a bug report. Show the IO_PAGE_FAULT and that you only get the tx ring flush errors after that, and taht a reboot is needed. State where your driver comes from. If this repo, also give the commit hash for the most recent commit shown with 'git log'.

These are exactly the kinds of problems that a volunteer like me cannot answer as I have no knowledge of the internals of the chip. [

lwfinger commented 2 years ago

Before you post a bug, please do a pull and get the latest firmware. Somehow, I had missed an update to the rtw8822c firmware.