linux-surface / kernel

Linux kernel with modifications for Microsoft Surface devices.
Other
124 stars 34 forks source link

WiFi Power-Management/Firmware issue (mwifiex) #20

Closed spokV closed 3 years ago

spokV commented 4 years ago

I'm running latest release (5.3) and after x minutes getting:

[ 6828.584671] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584687] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6828.584755] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584763] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587608] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587623] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587698] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587706] mwifiex_pcie 0000:02:00.0: failed to get signal information

only reboot solve the wifi crash for me.... I'm running sp 2017 ubuntu 19.10

GrayHatter commented 4 years ago

Retracted, it's the Surface Pro... I swear, I can read....

qzed commented 4 years ago

As far as I know. pretty much every device is affected by this in some way, except for the 7th generation models which have Intel WiFi chips. It seems to be some sort of (runtime) power-management/suspend issue.

I think it has been discussed on jakeday/linux-surface a couple of times by now. So maybe let's keep this issue to try and collect all that information.

Links

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109681 Ubuntu Bugs: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730924

Other issues:

(will update this with more links to relevant jakeday/linux-surface issues)

Possible Work-Arounds

kitakar5525 commented 4 years ago

(Copy from https://gitter.im/linux-surface/support?at=5de7a513f65fec1c8ebe1b00)

Regarding wifi dropouts while using or idle (not after suspend): SP2017 owner reported on Gitter linux-surface/community room that disabling L1 substates (IIRC) fixed the issue. (Disable L1 substates for all ASPM capable devices: echo powersave | sudo tee /sys/module/pcie_aspm/parameters/policy) (I don't know an easy way to disable the substates only for wifi device) For my Surface 3 (non-pro), disabling L1 states (disabling only substates is not enough) for wifi device is needed (jakeday/linux-surface#456)

Is there any influence on power consumption doing that?

Unfortunately, yes. Bad impact on power consumption when idle or suspend (s2idle). (not sure when using) Especially, if you disable ASPM L1 state, it will break S0ix on suspend (you can check the residency: sudo cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec).

You might need a patch from sebanc (https://github.com/jakeday/linux-surface/issues/420#issuecomment-544615371) to achieve S0ix states on suspend when you disable L1 state. At least it's the case on my Surface 3.

kitakar5525 commented 4 years ago

You might need a patch from sebanc (https://github.com/jakeday/linux-surface/issues/420#issuecomment-544615371) to achieve S0ix states on suspend when you disable L1 state. At least it's the case on my Surface 3.

Sorry but I may have said incorrect thing. (It seems to be true only on my Chromium OS 4.19 kernel… (The kernel may have other power management patches). Sorry to make everyone confused.) (On Arch Linux kernel (both 5.4 and 4.19) + Surface patches, I need to re-enable L1 state before suspend to achieve S0ix even with the patch from sebanc.)

If you are going to disable L1 state when you use your device, re-enabling the L1 state before suspend may be the best idea to achieve S0ix again. And disable L1 state again after suspend.

qzed commented 4 years ago

Thanks! I yet have to find the time to look into that

spokV commented 4 years ago

~You might need a patch from sebanc (jakeday/linux-surface#420 (comment)) to achieve S0ix states on suspend when you disable L1 state. At least it's the case on my Surface 3.~

Sorry but I may have said incorrect thing. (It seems to be true only on my Chromium OS 4.19 kernel… (The kernel may have other power management patches). Sorry to make everyone confused.) (On Arch Linux kernel (both 5.4 and 4.19) + Surface patches, I need to re-enable L1 state before suspend to achieve S0ix even with the patch from sebanc.)

If you are going to disable L1 state when you use your device, re-enabling the L1 state before suspend may be the best idea to achieve S0ix again. And disable L1 state again after suspend.

How do i re-enable it again?

kitakar5525 commented 4 years ago

On the script which you used to disable ASPM (ASPM_SETTING=0 means you disabled both L0s and L1), set ASPM_SETTING=3 instead to enable L0s and L1.

You can check ASPM state:

# 11ab:2b38 is marvell 88W8897 wifi controller

# ASPM disabled for both L0s and L1
$ sudo lspci -nnvvv -d 11ab:2b38 | grep "LnkCtl:"
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+

# ASPM disabled for L1 (L0s still enabled)
$ sudo lspci -nnvvv -d 11ab:2b38 | grep "LnkCtl:"
                LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- CommClk+

# ASPM enabled for both L0s and L1
$ sudo lspci -nnvvv -d 11ab:2b38 | grep "LnkCtl:"
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
spokV commented 4 years ago

On the script which you used to disable ASPM (ASPM_SETTING=0 means you disabled both L0s and L1), set ASPM_SETTING=3 instead to enable L0s and L1.

You can check ASPM state:

# 11ab:2b38 is marvell 88W8897 wifi controller

# ASPM disabled for both L0s and L1
$ sudo lspci -nnvvv -d 11ab:2b38 | grep "LnkCtl:"
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+

# ASPM disabled for L1 (L0s still enabled)
$ sudo lspci -nnvvv -d 11ab:2b38 | grep "LnkCtl:"
                LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- CommClk+

# ASPM enabled for both L0s and L1
$ sudo lspci -nnvvv -d 11ab:2b38 | grep "LnkCtl:"
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+

just need to call it from the sleep script?

kitakar5525 commented 4 years ago

Yes. I think calling from the sleep script is the simplest way.

spokV commented 4 years ago

Yes. I think calling from the sleep script is the simplest way.

I added the script for both pre and post cases in the sleep script. For the first time for a full half day i didn't have any wifi dropouts or crashes. But, i do fill the battery drain is faster in non suspend state. is it logical or i'm imagining things?

spokV commented 4 years ago

After a night in suspend i lost about 14% of battery and got no wifi. here is the 'dmesg' output: [24860.195812] PM: suspend entry (s2idle) [24860.199554] Filesystems sync: 0.003 seconds [24860.199832] Freezing user space processes ... (elapsed 0.001 seconds) done. [24860.201432] OOM killer disabled. [24860.201433] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [24860.202585] printk: Suspending console(s) (use no_console_suspend to debug) [24860.316810] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: mei_cldev_send() error 0x4:-19 [24860.316812] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: mei_cldev_send() error 0x7:-19 [24860.317963] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: error in reading m2h msg [24860.317971] IPTS removed [24882.435826] pci 0000:01:00.0: Refused to change power state, currently in D3 [24886.811333] pci 0000:01:00.0: Refused to change power state, currently in D3 [24886.833612] acpi device:47: Cannot transition to power state D0 for parent in D3hot [24886.835076] [drm] HuC: Loaded firmware i915/kbl_huc_ver02_00_1810.bin (version 2.0) [24886.839261] [drm] GuC: Loaded firmware i915/kbl_guc_ver9_39.bin (version 9.39) [24886.839454] i915 0000:00:02.0: GuC firmware version 9.39 [24886.839460] i915 0000:00:02.0: GuC submission enabled [24886.839465] i915 0000:00:02.0: HuC enabled [24887.964804] probing Intel Precise Touch & Stylus [24887.964806] IPTS using DMA_BIT_MASK(64) [24887.964832] OOM killer enabled. [24887.964833] Restarting tasks ... done. [24887.981139] usb 1-7: USB disconnect, device number 15 [24887.988924] PM: suspend exit [24887.997206] input: ipts 045E:001F as /devices/pci0000:00/0000:00:16.4/0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04/0044:045E:001F.001B/input/input243 [24888.099481] input: ipts 045E:001F as /devices/pci0000:00/0000:00:16.4/0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04/0044:045E:001F.001B/input/input244 [24888.159579] input: ipts 045E:001F UNKNOWN as /devices/pci0000:00/0000:00:16.4/0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04/0044:045E:001F.001B/input/input248 [24888.159663] hid-multitouch 0044:045E:001F.001B: input,hidraw0: HID v3ad00.00 Mouse [ipts 045E:001F] on heci3 [24888.253808] debugfs: File 'le_min_key_size' in directory 'hci0' already present! [24888.253811] debugfs: File 'le_max_key_size' in directory 'hci0' already present! [24888.253812] debugfs: File 'force_bredr_smp' in directory 'hci0' already present! [24888.281681] cfg80211: Loading compiled-in X.509 certificates for regulatory database [24888.281909] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7' [24888.303954] mwifiex_pcie: PCI memory map Virt0: ffffb930c3100000 PCI memory map Virt2: ffffb930c4100000 [24888.305873] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld [24888.305876] mwifiex_pcie 0000:01:00.0: WLAN FW is active [24889.423581] usb 1-7: new full-speed USB device number 16 using xhci_hcd [24889.572787] usb 1-7: New USB device found, idVendor=045e, idProduct=09c0, bcdDevice= 0.07 [24889.572793] usb 1-7: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [24889.572796] usb 1-7: Product: Surface Type Cover [24889.572799] usb 1-7: Manufacturer: Microsoft [24889.581344] input: Microsoft Surface Type Cover Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:09C0.001C/input/input249 [24889.640435] input: Microsoft Surface Type Cover Consumer Control as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:09C0.001C/input/input251 [24889.640814] input: Microsoft Surface Type Cover Touchpad as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:09C0.001C/input/input253 [24889.641479] hid-multitouch 0003:045E:09C0.001C: input,hiddev0,hidraw1: USB HID v1.11 Keyboard [Microsoft Surface Type Cover] on usb-0000:00:14.0-7/input0 [24891.487742] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [24898.359744] mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0xfa, act = 0x2000 [24898.359788] mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0 [24898.359794] mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0 [24898.359800] mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1 [24898.359805] mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0 [24898.359810] mwifiex_pcie 0000:01:00.0: last_cmd_index = 1 [24898.359816] mwifiex_pcie 0000:01:00.0: last_cmd_id: 00 00 fa 00 00 00 00 00 00 00 [24898.359821] mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 00 20 00 00 00 00 00 00 [24898.359827] mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0 [24898.359832] mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 00 00 00 00 00 00 00 00 00 00 [24898.359837] mwifiex_pcie 0000:01:00.0: last_event_index = 0 [24898.359846] mwifiex_pcie 0000:01:00.0: last_event: 00 00 00 00 00 00 00 00 00 00 [24898.359853] mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1 [24898.359863] mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0 [24898.360062] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device

kitakar5525 commented 4 years ago

mwifiex crash while idle or even using

But, i do fill the battery drain is faster in non suspend state. is it logical or i'm imagining things?

Your feeling about battery consumption increase while non-suspend state may be true. I'm not sure at least on my Surface 3, though. You may want to disable only L1 state (ASPM_SETTING=1)

It's just a workaround and we should now really fix this issue with ASPM L1 enabled. But sadly, it may be difficult for us non-marvell personnel…

Interestingly, I use Surface Book 1 and Surface 3 for daily usage but I have this issue only on Surface 3 with the same wifi controller (11ab:2b38). This fact may be a hint.

mwifiex crash after suspend

After a night in suspend i lost about 14% of battery and got no wifi. here is the 'dmesg' output:

If this issue will occur even with the latest sleep script, try the patch from sebanc.

I can reproduce this issue on SB1 after second suspend with non-patched distribution kernels and not using sleep script. I cannot reproduce this issue on Surface 3.

Recovery after mwifiex crash is broken

Those two issues might have a different cause. But at least the result is the same; after the output mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device, workqueue mwifiex_pcie_work will hang (INFO: task kworker/0:0:24857 blocked for more than 122 seconds.) and result in some commands to freeze (e.g. lspci, modprobe not responding and even we cannot poweroff PC cleanly).

At least we may be able to fix this workqueue hang after mwifiex crash.

Old journal log on Surface 3

Old journal log on Surface 3 with non-patched Arch Linux 5.3.11-arch1-1 kernel and ASPM L1 state enabled. (mwifiex crash log after overnight idle, then mwifiex_pcie_work hung)

Click to expand

```bash # Boot Nov 14 20:00:57 surface3 kernel: Linux version 5.3.11-arch1-1 (linux@archlinux) (gcc version 9.2.0 (GCC)) #1 SMP PREEMPT Tue, 12 Nov 2019 22:19:48 +0000 # mwifiex crashed Nov 16 03:22:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0xa4, act = 0x0 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_index = 1 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 a4 00 16 00 a4 00 7f 00 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 00 00 00 00 00 00 00 00 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 7f 80 16 80 a4 80 7f 80 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_event_index = 1 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_event: 58 00 58 00 58 00 58 00 58 00 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: data_sent=1 cmd_sent=1 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump start=== Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: PCIE register dump start Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: pcie scratch register: Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00 reg:0xcf8, value=0xca410a reg:0xcfc, value=0x2cfcf00 Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: PCIE register dump end Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump end=== Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump start == Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:28 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:33 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:33 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:33 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:33 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:36 surface3 kernel: mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state Nov 16 03:22:36 surface3 kernel: mwifiex_pcie 0000:01:00.0: scan failed: -14 Nov 16 03:22:39 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:39 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:39 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:39 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:45 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:45 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:45 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:45 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:51 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:51 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:51 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:51 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:57 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:57 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:22:57 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:22:57 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:23:03 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:23:03 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:23:03 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:23:03 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:23:09 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:23:09 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:23:09 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:23:09 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:23:15 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:23:15 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:23:15 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:23:15 surface3 kernel: mwifiex_pcie 0000:01:00.0: failed to get signal information Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: Firmware dump Finished! Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump end == Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump start Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump end Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: deleting the crypto keys Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: deleting the crypto keys Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: deleting the crypto keys Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: deleting the crypto keys Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: deleting the crypto keys Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: deleting the crypto keys Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: rx_pending=0, tx_pending=1, cmd_pending=0 Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:16 surface3 kernel: mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed Nov 16 03:23:17 surface3 kernel: mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld Nov 16 03:23:17 surface3 kernel: mwifiex_pcie 0000:01:00.0: WLAN FW is active Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0xfa, act = 0xa000 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_index = 2 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 a4 00 fa 00 a4 00 7f 00 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 00 00 00 a0 00 00 00 00 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 7f 80 16 80 a4 80 7f 80 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_event_index = 1 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: last_event: 58 00 58 00 58 00 58 00 58 00 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0 Nov 16 03:23:27 surface3 kernel: mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device # mwifiex_pcie_work hung Nov 16 03:27:04 surface3 kernel: INFO: task kworker/0:0:24857 blocked for more than 122 seconds. Nov 16 03:27:04 surface3 kernel: Tainted: G W OE 5.3.11-arch1-1 #1 Nov 16 03:27:04 surface3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 16 03:27:04 surface3 kernel: kworker/0:0 D 0 24857 2 0x80004000 Nov 16 03:27:04 surface3 kernel: Workqueue: events mwifiex_pcie_work [mwifiex_pcie] Nov 16 03:27:04 surface3 kernel: Call Trace: Nov 16 03:27:04 surface3 kernel: ? __schedule+0x27f/0x6d0 Nov 16 03:27:04 surface3 kernel: schedule+0x43/0xd0 Nov 16 03:27:04 surface3 kernel: schedule_timeout+0x299/0x3d0 Nov 16 03:27:04 surface3 kernel: ? __switch_to_asm+0x40/0x70 Nov 16 03:27:04 surface3 kernel: wait_for_common+0xeb/0x190 Nov 16 03:27:04 surface3 kernel: ? wake_up_q+0x60/0x60 Nov 16 03:27:04 surface3 kernel: __flush_work+0x130/0x1e0 Nov 16 03:27:04 surface3 kernel: ? flush_workqueue_prep_pwqs+0x130/0x130 Nov 16 03:27:04 surface3 kernel: __cancel_work_timer+0x123/0x1b0 Nov 16 03:27:04 surface3 kernel: mwifiex_cleanup_pcie+0x28/0xd0 [mwifiex_pcie] Nov 16 03:27:04 surface3 kernel: mwifiex_free_adapter+0x24/0xe0 [mwifiex] Nov 16 03:27:04 surface3 kernel: _mwifiex_fw_dpc+0x28d/0x520 [mwifiex] Nov 16 03:27:04 surface3 kernel: mwifiex_reinit_sw+0x15d/0x2c0 [mwifiex] Nov 16 03:27:04 surface3 kernel: mwifiex_pcie_reset_done+0x50/0x80 [mwifiex_pcie] Nov 16 03:27:04 surface3 kernel: pci_try_reset_function+0x38/0x70 Nov 16 03:27:04 surface3 kernel: process_one_work+0x1d1/0x3a0 Nov 16 03:27:04 surface3 kernel: worker_thread+0x4a/0x3d0 Nov 16 03:27:04 surface3 kernel: kthread+0xfb/0x130 Nov 16 03:27:04 surface3 kernel: ? process_one_work+0x3a0/0x3a0 Nov 16 03:27:04 surface3 kernel: ? kthread_park+0x80/0x80 Nov 16 03:27:04 surface3 kernel: ret_from_fork+0x35/0x40 ```

spokV commented 4 years ago

I'm a bit confused, what is a reasonable battery drain in suspend using the latest qzed 5.3.15 kernel? (original patches)

kitakar5525 commented 4 years ago

For battery drain while suspend, have a look at my issue: https://github.com/jakeday/linux-surface/issues/554

Especially, on the issue, a SP5 owner reported around 1.5% per hour drain while achieving S0ix states (https://github.com/jakeday/linux-surface/issues/554#issuecomment-537835794). It still sounds too high for S0ix. But I think this is not necessarily Surface devices issue but all Intel Core CPU series devices issue with S0ix on Linux.

For reference, on my Surface 3 (Intel Atom Cherry Trail), it drains around 0.15 % per hour while suspend with S0ix.

spokV commented 4 years ago

I added to the updated sleep script:

pre) # unload the modules before going to sleep /lib/systemd/system-sleep/set_wireless_aspm.sh 3

post) # re-load modules after resume /lib/systemd/system-sleep/set_wireless_aspm.sh 1

I use the set_wireless_aspm script (same as the disable_wireless_aspm.sh with single change of getting state as argument). enabled L0s while not in suspend. till now had no issues with wifi dropouts while working. will continue testing battery drain too.

qzed commented 4 years ago

I've decided to revert the current WiFi patch and include the one from @sebanc instead (on 5.3 and 5.4 for now, I think 4.19 should follow if we don't notice any significant drawbacks, being an LTS kernel and all). This seems a lot more stable to me than the sleep script (which for me on the SB2 had the occasional hick-up). I've only tested the patch on 5.4 on my SB2 so please let me know if there are any issues. (Note that this has not made its way into the pre-built releases yet, I'll update them sometime soon though.)

kitakar5525 commented 4 years ago

With the patch from @sebanc for several days, wifi will still work after several suspend and I don't notice any critical side effects on the following combinations:

But currently, I also still apply current wifi patch, too. I haven't yet well tested the patch from sebanc without applying the current wifi patch. I expect no problem, though.

I'll try the patch from sebanc with reverting current wifi patch, then I'll see what will happen today.

qzed commented 4 years ago

@kitakar5525 Thanks!

kitakar5525 commented 4 years ago

Forgot to mention, I'm using the patch from sebanc with reverting patch to sta_cmd.c because it seems to be not necessary (https://github.com/kitakar5525/linux-surface-patches/tree/master/patch-5.4/0003-mwifiex-Wi-Fi-drops-after-suspend-%23420-s0ix-test)

and with some small modifications (adding __maybe_unused to old mwifiex_pcie_suspend/mwifiex_pcie_resume, moving mwifiex_pcie_suspend2 definition above mwifiex_pcie_resume2)

spokV commented 4 years ago

I've decided to revert the current WiFi patch and include the one from @sebanc instead (on 5.3 and 5.4 for now, I think 4.19 should follow if we don't notice any significant drawbacks, being an LTS kernel and all). This seems a lot more stable to me than the sleep script (which for me on the SB2 had the occasional hick-up). I've only tested the patch on 5.4 on my SB2 so please let me know if there are any issues. (Note that this has not made its way into the pre-built releases yet, I'll update them sometime soon though.)

So does it mean that we need to remove the sleep script and the current wifi.patch and apply @sebanc patch instead?

kitakar5525 commented 4 years ago

@qzed After trying the patch from sebanc with reverting current wifi patch today on 5.4 and 4.19:

# https://github.com/qzed/linux-surface/blob/9e132617b877f6dbc8c042e493c2d6ed5ddf45ea/patches/5.4/0007-wifi.patch#L35-L50
diff --git a/drivers/net/wireless/marvell/mwifiex/cfg80211.c b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
index d89684168500..1545bae9d6cf 100644
--- a/drivers/net/wireless/marvell/mwifiex/cfg80211.c
+++ b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
@@ -437,7 +437,10 @@ mwifiex_cfg80211_set_power_mgmt(struct wiphy *wiphy,
         mwifiex_dbg(priv->adapter, INFO,
                 "info: ignore timeout value for IEEE Power Save\n");

-    ps_mode = enabled;
+    //ps_mode = enabled;
+
+    mwifiex_dbg(priv->adapter, INFO, "overriding ps_mode to false\n");
+    ps_mode = 0;

     return mwifiex_drv_set_power(priv, &ps_mode);
 }
# https://github.com/qzed/linux-surface/blob/9e132617b877f6dbc8c042e493c2d6ed5ddf45ea/patches/5.4/0007-wifi.patch#L67-L75
diff --git a/drivers/net/wireless/marvell/mwifiex/cmdevt.c b/drivers/net/wireless/marvell/mwifiex/cmdevt.c
index e8788c35a453..82d25b3ca914 100644
--- a/drivers/net/wireless/marvell/mwifiex/cmdevt.c
+++ b/drivers/net/wireless/marvell/mwifiex/cmdevt.c
@@ -1011,11 +1012,11 @@ mwifiex_cmd_timeout_func(struct timer_list *t)
         return;
     }

-    if (adapter->if_ops.device_dump)
-        adapter->if_ops.device_dump(adapter);
+    //if (adapter->if_ops.device_dump)
+    //    adapter->if_ops.device_dump(adapter);

-    if (adapter->if_ops.card_reset)
-        adapter->if_ops.card_reset(adapter);
+    //if (adapter->if_ops.card_reset)
+    //    adapter->if_ops.card_reset(adapter);
 }

So, reverting all the current wifi patches might not cause any critical issues, at least for me.

kitakar5525 commented 4 years ago

@spokV You may still need bluetooth part to achieve s0ix.

spokV commented 4 years ago

@spokV You may still need bluetooth part to achieve s0ix.

Ok. So need to leave sleep script with Bluetooth on/off....

spokV commented 4 years ago

I’m running 5.4.5 with @sebanc patch only and no issues with wifi/suspend. Will resume sleep script for bt and check power consumption in suspend

kitakar5525 commented 4 years ago

I found interesting commits on chromiumos kernels. I don't have much time available this weekend. If anyone feels it's interesting, please try them.

card reset after wifi crash

Chromebook Kevin (Samsung Chromebook Plus, aarch64 RK3399 [1]) uses marvell-pcie8997. (Our card is pcie8897, but maybe similar)

and for the pcie8997 card, to reset properly after crash, they added double-reset quirk:

wifi and bluetooth coexisting

Chromebook Acer Oak (Acer Chromebook R13, aarch64 MT8173 [1]) uses marvell_sd8787 [2] (not pcie, but might be helpful)

and for the Chromebook, timeshare_coex is set to default for A2DP audio stability:

timeshare_coex mode is already available after normal Linux 4.4 and we can switch to the mode (not enabled by default):

Go back to spacial coexistence mode echo 0 | sudo tee /sys/kernel/debug/mwifiex/mlan0/timeshare_coex



Changing to timeshare_coex mode may help people who have wifi speed issue? not sure but worth trying.

References:
- [1] https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices
- [2] https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/master/baseboard-oak/chromeos-base/chromeos-bsp-baseboard-oak/chromeos-bsp-baseboard-oak-0.0.1.ebuild#20
kitakar5525 commented 4 years ago

You can reproduce the card reset failure by

echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

and watch dmesg output.

kern  :info  : [37097.529244] mwifiex_pcie 0000:03:00.0: Resetting per request
kern  :info  : [37097.539387] mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]: reason code 3
kern  :info  : [37097.539429] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37097.539432] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37097.539433] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37097.539434] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37097.539435] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37097.539436] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37097.539437] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37097.539438] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37097.539439] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37097.539440] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37097.539441] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37097.539442] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37097.539479] mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex...
kern  :info  : [37097.541675] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37097.541681] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
kern  :info  : [37098.813217] mwifiex_pcie 0000:03:00.0: WLAN FW already running! Skip FW dnld
kern  :info  : [37098.813220] mwifiex_pcie 0000:03:00.0: WLAN FW is active
kern  :info  : [37098.858964] mwifiex_pcie 0000:03:00.0: Unknown api_id: 4
kern  :info  : [37098.879858] mwifiex_pcie 0000:03:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) 
kern  :info  : [37098.879863] mwifiex_pcie 0000:03:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)
kern  :info  : [37105.111037] mwifiex_pcie 0000:03:00.0: info: trying to associate to '[SSID]' bssid [BSSID]
kern  :info  : [37105.230939] mwifiex_pcie 0000:03:00.0: info: associated to bssid [BSSID] successfully
kern  :info  : [37127.611403] mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
kern  :info  : [37137.375511] mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]: reason code 15
kern  :info  : [37139.771169] mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
kern  :info  : [37139.771178] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37151.930965] mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
kern  :info  : [37151.930973] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
kern  :info  : [37164.090743] mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
kern  :info  : [37164.090753] mwifiex_pcie 0000:03:00.0: deleting the crypto keys
[...]

This is also related to lockup I mentioned before, but I feel it will not cause lockup now. Not sure yet.

spokV commented 4 years ago

Wow! Left my surface pro 2017 for 6 hours in suspend mode. Got only 5% battery reduction. I’m using @sebanc wifi patch and sleep script with only BT toggling.

spokV commented 4 years ago

After few days i got this crush when exiting suspend: wifi_crush.txt

qzed commented 4 years ago

@kitakar5525 Thanks for your notes. It's awesome to have that all written down. I'd like to keep the power-save option in the config file for now, but it's good to know that we have an alternative.

Also thanks for your links, they look quite interesting. I agree that the cards are probably pretty similar, so I'll definitely try to play around with the double-reset and timeshare_coex when I encounter those issues again.

kitakar5525 commented 4 years ago

When reverting the current wifi patch, need to inform users who don't use NetworkManager that you may need to disable wifi power_save manually:

# run command like this to disable wifi powersave after startup/suspend
sudo iw dev mlan0 set power_save off

or create udev rule (create a file like /etc/udev/rules.d/81-marvell-wifi-powersave-off.rules with following contents):

# https://wiki.archlinux.org/index.php/Power_management#Network_interfaces
ACTION=="add", SUBSYSTEM=="net", KERNEL=="mlan*", RUN+="/usr/bin/iw dev $name set power_save off"

(Assuming marvell wifi device name is mlan0)

kitakar5525 commented 4 years ago

Regarding double-reset quirk, it did not work for SB1 and Surface 3 just as is… just leaving what I did here.

kitakar5525 commented 4 years ago

I tried a quirk that adds power-cycling wifi and its root port (D3cold/D0) before pcie_flr().

The reset function is working now for SB1. I think this quirk will work for SP4/SB1 generation or later. But unfortunately, not working for Surface 3. Maybe not working also for SP3 or former.

Just leaving what I did here. I really hope to fix this issue on mwifiex driver side and for Surface 3.

0001-PCI-marvell_powercycle_then_flr-quirk.patch

```diff From a5526bbfffe146d7297fd9019c964995a3e64d79 Mon Sep 17 00:00:00 2001 From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com> Date: Tue, 24 Dec 2019 22:19:27 +0900 Subject: [PATCH] PCI: marvell_powercycle_then_flr quirk --- drivers/pci/quirks.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 308f74439..3ae947ec6 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3895,6 +3895,30 @@ static int delay_250ms_after_flr(struct pci_dev *dev, int probe) return 0; } +#define PCIE_DEVICE_ID_MARVELL_88W8897 0x2b38 + +static int marvell_powercycle_then_flr(struct pci_dev *dev, int probe) +{ + struct pci_dev *parent_dev = pci_upstream_bridge(dev); + + if (probe) + return 0; + + /* Power cycle */ + pci_alert(dev, "DEBUG: putting into D3cold...\n"); + pci_set_power_state(dev, PCI_D3cold); + pci_alert(parent_dev, "DEBUG: putting into D3cold...\n"); + pci_set_power_state(parent_dev, PCI_D3cold); + pci_alert(parent_dev, "DEBUG: putting into D0...\n"); + pci_set_power_state(parent_dev, PCI_D0); + pci_alert(dev, "DEBUG: putting into D0...\n"); + pci_set_power_state(dev, PCI_D0); + + pcie_flr(dev); + + return 0; +} + static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF, reset_intel_82599_sfp_virtfn }, @@ -3906,6 +3930,10 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { { PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr }, { PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID, reset_chelsio_generic_dev }, + { PCI_VENDOR_ID_MARVELL, PCIE_DEVICE_ID_MARVELL_88W8897, + marvell_powercycle_then_flr }, + { PCI_VENDOR_ID_MARVELL_EXT, PCIE_DEVICE_ID_MARVELL_88W8897, + marvell_powercycle_then_flr }, { 0 } }; -- 2.24.1 ```

log

```bash Log on Surface Book 1. This quirk works for SB1 but not work for Surface 3. kern :notice: [ 0.000000] Linux version 4.19.90 (ubuntu@ip-172-31-25-89) (Chromium OS 10.0_pre370808_p20191029-r6 clang version 10.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm-project 1bea97c971d60f261f1bdfaa7b6d9cb30a6962fd) (based on LLVM 10.0.0svn)) #13 SMP PREEMPT Tue Dec 24 22:21:39 JST 2019 kern :info : [ 0.000000] Command line: BOOT_IMAGE=/syslinux/vmlinuz.A init=/sbin/init boot=local rootwait ro noresume noswap loglevel=7 noinitrd console= i915.modeset=1 cros_efi cros_debug root=PARTUUID=580596c5-d3fe-4fe4-98b5-3bc8df601d10 nvme_core.default_ps_max_latency_us=200000 [...] # # Performing reset: # echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset # reset can be done several times and wifi is working after several reset # kern :info : [ 484.558364] mwifiex_pcie 0000:03:00.0: Resetting per request kern :info : [ 484.560497] mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]: reason code 3 kern :info : [ 484.560590] mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex... kern :info : [ 484.568945] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.568955] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.573897] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.573900] mwifiex_pcie 0000:03:00.0: deleting the crypto keys kern :info : [ 484.573902] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.573903] mwifiex_pcie 0000:03:00.0: deleting the crypto keys kern :info : [ 484.573904] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.573905] mwifiex_pcie 0000:03:00.0: deleting the crypto keys kern :info : [ 484.573906] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.573908] mwifiex_pcie 0000:03:00.0: deleting the crypto keys kern :info : [ 484.573909] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.573910] mwifiex_pcie 0000:03:00.0: deleting the crypto keys kern :info : [ 484.573911] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 484.573912] mwifiex_pcie 0000:03:00.0: deleting the crypto keys kern :alert : [ 484.587074] mwifiex_pcie 0000:03:00.0: DEBUG: putting into D3cold... kern :info : [ 484.608531] usb 1-6: USB disconnect, device number 6 kern :alert : [ 484.865787] pcieport 0000:00:1d.3: DEBUG: putting into D3cold... kern :info : [ 484.913199] arc_mlan0: port 1(veth_mlan0) entered disabled state kern :info : [ 484.917229] device veth_mlan0 left promiscuous mode kern :info : [ 484.917234] arc_mlan0: port 1(veth_mlan0) entered disabled state kern :alert : [ 484.934806] pcieport 0000:00:1d.3: DEBUG: putting into D0... kern :alert : [ 485.149591] mwifiex_pcie 0000:03:00.0: DEBUG: putting into D0... # # FW is re-downloaded here # kern :info : [ 486.610539] mwifiex_pcie 0000:03:00.0: info: FW download over, size 723540 bytes kern :info : [ 487.316584] usb 1-6: new high-speed USB device number 7 using xhci_hcd kern :warn : [ 487.443019] usb 1-6: config 1 interface 1 altsetting 0 endpoint 0x85 has wMaxPacketSize 0, skipping kern :warn : [ 487.443031] usb 1-6: config 1 interface 1 altsetting 0 endpoint 0x5 has wMaxPacketSize 0, skipping kern :info : [ 487.443572] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01 kern :info : [ 487.443577] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3 kern :info : [ 487.443580] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device kern :info : [ 487.443583] usb 1-6: Manufacturer: Marvell kern :info : [ 487.443585] usb 1-6: SerialNumber: 0000000000000000 kern :debug : [ 487.445827] usb 1-6: GPIO lookup for consumer reset kern :debug : [ 487.445829] usb 1-6: using ACPI for GPIO lookup kern :debug : [ 487.445832] acpi device:24: GPIO: looking up reset-gpios kern :debug : [ 487.445835] acpi device:24: GPIO: looking up reset-gpio kern :debug : [ 487.445836] usb 1-6: using lookup tables for GPIO lookup kern :debug : [ 487.445838] usb 1-6: No GPIO consumer reset found kern :err : [ 487.446487] Bluetooth: hci_event.c:hci_cmd_status_evt() hci0: unexpected event for opcode 0x0000 kern :err : [ 487.446735] Bluetooth: hci_event.c:hci_cmd_status_evt() hci0: unexpected event for opcode 0x0000 kern :info : [ 487.649361] mwifiex_pcie 0000:03:00.0: WLAN FW is active kern :info : [ 487.714425] mwifiex_pcie 0000:03:00.0: Unknown api_id: 4 kern :info : [ 487.745324] mwifiex_pcie 0000:03:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) kern :info : [ 487.745327] mwifiex_pcie 0000:03:00.0: driver_version = mwifiex 1.0 (15.68.19.p21) kern :info : [ 487.798365] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready kern :info : [ 487.885646] IPv6: ADDRCONF(NETDEV_UP): veth_mlan0: link is not ready kern :info : [ 487.896916] arc_mlan0: port 1(veth_mlan0) entered blocking state kern :info : [ 487.896919] arc_mlan0: port 1(veth_mlan0) entered disabled state kern :info : [ 487.897142] device veth_mlan0 entered promiscuous mode kern :info : [ 487.897720] arc_mlan0: port 1(veth_mlan0) entered blocking state kern :info : [ 487.897722] arc_mlan0: port 1(veth_mlan0) entered forwarding state kern :info : [ 487.914311] mlan0: renamed from peer_mlan0 kern :info : [ 487.921307] IPv6: ADDRCONF(NETDEV_CHANGE): veth_mlan0: link becomes ready kern :info : [ 490.928848] mwifiex_pcie 0000:03:00.0: info: trying to associate to '[SSID]' bssid [BSSID] kern :info : [ 491.055054] mwifiex_pcie 0000:03:00.0: info: associated to bssid [BSSID] successfully kern :info : [ 491.120328] IPv6: ADDRCONF(NETDEV_CHANGE): mlan0: link becomes ready this quirk has side effect (`sudo lspci -nnvvv` for wifi device): Before: L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ after: L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- Can be fixed: echo powersave | sudo tee /sys/module/pcie_aspm/parameters/policy echo powersupersave | sudo tee /sys/module/pcie_aspm/parameters/policy ```

spokV commented 4 years ago

I tried a quirk that adds power-cycling wifi and its root port (D3cold/D0) before pcie_flr().

The reset function is working now for SB1. I think this quirk will work for SP4/SB1 generation or later. But unfortunately, not working for Surface 3. Maybe not working also for SP3 or former.

Just leaving what I did here. I really hope to fix this issue on mwifiex driver side and for Surface 3.

0001-PCI-marvell_powercycle_then_flr-quirk.patch

From a5526bbfffe146d7297fd9019c964995a3e64d79 Mon Sep 17 00:00:00 2001
From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com>
Date: Tue, 24 Dec 2019 22:19:27 +0900
Subject: [PATCH] PCI: marvell_powercycle_then_flr quirk

---
 drivers/pci/quirks.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 308f74439..3ae947ec6 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3895,6 +3895,30 @@ static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
  return 0;
 }

+#define PCIE_DEVICE_ID_MARVELL_88W8897     0x2b38
+
+static int marvell_powercycle_then_flr(struct pci_dev *dev, int probe)
+{
+ struct pci_dev *parent_dev = pci_upstream_bridge(dev);
+
+ if (probe)
+     return 0;
+
+ /* Power cycle */
+ pci_alert(dev, "DEBUG: putting into D3cold...\n");
+ pci_set_power_state(dev, PCI_D3cold);
+ pci_alert(parent_dev, "DEBUG: putting into D3cold...\n");
+ pci_set_power_state(parent_dev, PCI_D3cold);
+ pci_alert(parent_dev, "DEBUG: putting into D0...\n");
+ pci_set_power_state(parent_dev, PCI_D0);
+ pci_alert(dev, "DEBUG: putting into D0...\n");
+ pci_set_power_state(dev, PCI_D0);
+
+ pcie_flr(dev);
+
+ return 0;
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
  { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
       reset_intel_82599_sfp_virtfn },
@@ -3906,6 +3930,10 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
  { PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr },
  { PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
      reset_chelsio_generic_dev },
+ { PCI_VENDOR_ID_MARVELL, PCIE_DEVICE_ID_MARVELL_88W8897,
+     marvell_powercycle_then_flr },
+ { PCI_VENDOR_ID_MARVELL_EXT, PCIE_DEVICE_ID_MARVELL_88W8897,
+     marvell_powercycle_then_flr },
  { 0 }
 };

-- 
2.24.1

log

Hi @kitakar5525, Can you give the complete patches+scipts needed on top "clean" kernel for pro 2017 so i could build myself?

kitakar5525 commented 4 years ago

Here is the (complete) patch!: 0001-v2-PCI-marvell_powercycle_then_flr-quirk.patch.gz

Changes in v2

It should apply cleanly on normal Linux 4.19/5.4.

and just run the following command to test the reset function (no script needed to test):

echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

To be honest, I have not yet tested if the wifi can actually recover from bad state (e.g. "wifi crash after idle or using" with ASPM L1 enabled, I can't test the case because I cannot reproduce the issue on SB1). So, I'm glad if you can test the case.

By the way, according to pci spec, we can go into D3cold only from D3hot. So, depending on the power state the reset happens, it may not be compliant to the spec. But the patch just works for SB1 (but not working for Surface 3).

EDIT: tried to recover from "wifi crash after suspend" on SB1

and actually wifi recovered from bad state several times. Anyway I'm not very interested in this approach (power-cycling quirk); I want to fix the reset function for all Surface devices including Surface 3.

log

```bash Log on Surface Book 1 on chromeos-4.19 4.19.90 (should work also on normal linux kernel) with - old wifi.patch reverted to perform card reset (the patch disables card reset) - patch from sebanc reverted to reproduce wifi crash after suspend - v2-PCI-marvell_powercycle_then_flr-quirk.patch applied to try to recover from wifi crash after suspend - not reloading any modules from suspend script Lines starting with "/usr/bin/powerd_suspend" are output from sleep script by me, showing suspend stats kern :notice: [ 0.000000] Linux version 4.19.90 (ubuntu@ip-172-31-25-89) (Chromium OS 10.0_pre370808_p20191029-r6 clang version 10.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm-project 1bea97c971d60f261f1bdfaa7b6d9cb30a6962fd) (based on LLVM 10.0.0svn)) #15 SMP PREEMPT Fri Dec 27 20:34:47 JST 2019 kern :info : [ 0.000000] Command line: BOOT_IMAGE=/syslinux/vmlinuz.A init=/sbin/init boot=local rootwait ro noresume noswap loglevel=7 noinitrd console= i915.modeset=1 cros_efi cros_debug root=PARTUUID=580596c5-d3fe-4fe4-98b5-3bc8df601d10 nvme_core.default_ps_max_latency_us=200000 [...] # # Suspend start # kern :info : [ 238.432308] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: notify gfx status : 0 kern :info : [ 238.443756] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: cmd : 0x80000005, status 9 user :warn : [ 238.606806] /usr/bin/powerd_suspend: before_suspend_hook called user :warn : [ 238.715464] /usr/bin/powerd_suspend: battery: percent_sysfs_before_suspend: 100.00000000000000000000 % user :warn : [ 238.731930] /usr/bin/powerd_suspend: battery: percent_psi_before_suspend: 100 % kern :info : [ 238.745588] PM: suspend entry (s2idle) kern :info : [ 238.745593] PM: Syncing filesystems ... done. kern :debug : [ 238.822484] PM: Preparing system for sleep (s2idle) kern :info : [ 238.824668] Freezing user space processes ... (elapsed 0.002 seconds) done. kern :info : [ 238.827645] OOM killer disabled. kern :info : [ 238.827649] Freezing remaining freezable tasks ... (elapsed 0.024 seconds) done. kern :debug : [ 238.851957] PM: Suspending system (s2idle) kern :err : [ 238.853347] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: mei_cldev_send() error 0x4:-19 kern :err : [ 238.853356] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: mei_cldev_send() error 0x7:-19 kern :info : [ 238.855277] mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]: reason code 3 [...] kern :info : [ 238.858820] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: !! end event loop !! kern :info : [ 238.858919] IPTS removed kern :info : [ 238.861940] mwifiex_pcie 0000:03:00.0: None of the WOWLAN triggers enabled kern :debug : [ 239.400980] PM: suspend of devices complete after 548.003 msecs kern :debug : [ 239.417104] PM: late suspend of devices complete after 16.120 msecs kern :debug : [ 239.479509] PM: noirq suspend of devices complete after 61.319 msecs kern :debug : [ 239.479512] PM: suspend-to-idle # # Resume from suspend # kern :debug : [ 343.523291] PM: Timekeeping suspended for 104.320 seconds kern :debug : [ 343.525015] PM: Wakeup pending, aborting suspend kern :debug : [ 343.525021] PM: last active wakeup source: MSHW0040:00 kern :debug : [ 343.525023] PM: resume from suspend-to-idle kern :info : [ 343.666074] mwifiex_pcie 0000:03:00.0: Refused to change power state, currently in D3 kern :debug : [ 343.695743] PM: noirq resume of devices complete after 170.717 msecs kern :debug : [ 343.752120] PM: early resume of devices complete after 3.986 msecs kern :info : [ 343.755920] [drm] HuC: Loaded firmware i915/skl_huc_ver01_07_1398.bin (version 1.7) kern :info : [ 343.760918] [drm] GuC: Loaded firmware i915/skl_guc_ver9_33.bin (version 9.33) kern :info : [ 343.761048] i915 0000:00:02.0: GuC firmware version 9.33 kern :info : [ 343.761050] i915 0000:00:02.0: GuC submission enabled kern :info : [ 343.761052] i915 0000:00:02.0: HuC enabled kern :debug : [ 343.830619] PM: resume of devices complete after 78.494 msecs kern :debug : [ 343.831041] PM: Finishing wakeup. kern :info : [ 343.831053] probing Intel Precise Touch & Stylus kern :info : [ 343.831055] IPTS using DMA_BIT_MASK(64) kern :info : [ 343.831063] OOM killer enabled. kern :info : [ 343.831065] Restarting tasks ... done. kern :info : [ 343.859767] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: ipts_hid_parse() start [...] kern :err : [ 343.891249] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: touch enabled 4 kern :info : [ 344.094297] PM: suspend exit user :warn : [ 344.096719] /usr/bin/powerd_suspend: after_suspend_hook called user :warn : [ 344.175628] /usr/bin/powerd_suspend: suspend time: 106 sec (.02944444444444444444 hour) user :warn : [ 344.176797] /usr/bin/powerd_suspend: ========== battery stats from sysfs ========== /usr/bin/powerd_suspend: battery: energy_rate_sysfs_micro_watt: 0 uW (0 W) (assuming unit is micro) /usr/bin/powerd_suspend: battery: percent_sysfs_after_suspend: 100.00000000000000000000 % /usr/bin/powerd_suspend: battery: percent_sysfs_diff: 0 % /usr/bin/powerd_suspend: battery: percent_per_hour_sysfs: 0 % per hour user :warn : [ 344.181355] /usr/bin/powerd_suspend: ========== battery stats from power_supply_info ========== /usr/bin/powerd_suspend: battery: energy_rate_psi_watt: 0 W /usr/bin/powerd_suspend: battery: percent_psi_after_suspend: 100 % /usr/bin/powerd_suspend: battery: percent_psi_diff: 0 % /usr/bin/powerd_suspend: battery: percent_per_hour_psi: 0 % per hour user :warn : [ 344.182430] /usr/bin/powerd_suspend: ========== sleep residency stats ========== /usr/bin/powerd_suspend: lpi_cpu_us_diff: 103944783 usec (103.94478300000000000000 sec) /usr/bin/powerd_suspend: lpi_system_us_diff: usec ( sec) /usr/bin/powerd_suspend: s0ix_residency_us_diff: 0 usec (0 sec) kern :info : [ 344.480922] ipts 0000:00:16.4-3e8d0870-271a-4208-8eb5-9acb9402ae04: notify gfx status : 1 kern :err : [ 348.882462] i2c_hid i2c-MSHW0030:00: failed to retrieve report from device. # # wifi crash happened # kern :info : [ 353.852847] mwifiex_pcie 0000:03:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0x10, act = 0x1 kern :info : [ 353.852857] mwifiex_pcie 0000:03:00.0: num_data_h2c_failure = 0 kern :info : [ 353.852863] mwifiex_pcie 0000:03:00.0: num_cmd_h2c_failure = 0 kern :info : [ 353.852867] mwifiex_pcie 0000:03:00.0: is_cmd_timedout = 1 kern :info : [ 353.852872] mwifiex_pcie 0000:03:00.0: num_tx_timeout = 0 kern :info : [ 353.852877] mwifiex_pcie 0000:03:00.0: last_cmd_index = 3 kern :info : [ 353.852883] mwifiex_pcie 0000:03:00.0: last_cmd_id: 28 00 28 00 e5 00 10 00 28 00 kern :info : [ 353.852888] mwifiex_pcie 0000:03:00.0: last_cmd_act: 13 00 13 00 01 00 01 00 13 00 kern :info : [ 353.852893] mwifiex_pcie 0000:03:00.0: last_cmd_resp_index = 4 kern :info : [ 353.852897] mwifiex_pcie 0000:03:00.0: last_cmd_resp_id: 28 80 28 80 28 80 28 80 e5 80 kern :info : [ 353.852902] mwifiex_pcie 0000:03:00.0: last_event_index = 2 kern :info : [ 353.852907] mwifiex_pcie 0000:03:00.0: last_event: 0b 00 0a 00 0a 00 0a 00 0a 00 kern :info : [ 353.852912] mwifiex_pcie 0000:03:00.0: data_sent=0 cmd_sent=1 kern :info : [ 353.852916] mwifiex_pcie 0000:03:00.0: ps_mode=0 ps_state=0 kern :info : [ 353.854185] mwifiex_pcie 0000:03:00.0: ===mwifiex driverinfo dump start=== kern :info : [ 353.854194] mwifiex_pcie 0000:03:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) kern :info : [ 353.854209] mwifiex_pcie 0000:03:00.0: PCIE register dump start kern :info : [ 353.854378] mwifiex_pcie 0000:03:00.0: failed to read firmware status kern :info : [ 353.854424] mwifiex_pcie 0000:03:00.0: ===mwifiex driverinfo dump end=== kern :info : [ 353.854429] mwifiex_pcie 0000:03:00.0: == mwifiex firmware dump start == kern :info : [ 353.854598] mwifiex_pcie 0000:03:00.0: == mwifiex dump information to /sys/class/devcoredump start kern :info : [ 353.854768] mwifiex_pcie 0000:03:00.0: == mwifiex dump information to /sys/class/devcoredump end kern :info : [ 353.856935] mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex... kern :info : [ 353.881469] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed kern :info : [ 353.881482] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed # # Perofrming wifi reset with power-cycling quirk # kern :info : [ 353.906945] mwifiex_pcie 0000:03:00.0: Performing power-cycling (D3cold -> D0) for wifi and its root port before FLR kern :info : [ 353.906948] mwifiex_pcie 0000:03:00.0: DEBUG: putting into D3cold... kern :info : [ 353.919544] usb 1-6: USB disconnect, device number 7 kern :info : [ 353.996203] arc_mlan0: port 1(veth_mlan0) entered disabled state kern :info : [ 354.001209] device veth_mlan0 left promiscuous mode kern :info : [ 354.001215] arc_mlan0: port 1(veth_mlan0) entered disabled state kern :info : [ 354.179998] pcieport 0000:00:1d.3: DEBUG: putting into D3cold... kern :info : [ 354.248326] pcieport 0000:00:1d.3: DEBUG: putting into D0... kern :info : [ 354.462919] mwifiex_pcie 0000:03:00.0: DEBUG: putting into D0... # # wifi firmware re-downloaded # kern :info : [ 355.880569] mwifiex_pcie 0000:03:00.0: info: FW download over, size 723540 bytes kern :err : [ 356.367558] i2c_hid i2c-MSHW0030:00: failed to retrieve report from device. kern :info : [ 356.581939] usb 1-6: new high-speed USB device number 8 using xhci_hcd kern :warn : [ 356.708490] usb 1-6: config 1 interface 1 altsetting 0 endpoint 0x85 has wMaxPacketSize 0, skipping kern :warn : [ 356.708498] usb 1-6: config 1 interface 1 altsetting 0 endpoint 0x5 has wMaxPacketSize 0, skipping kern :info : [ 356.709520] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01 kern :info : [ 356.709526] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3 kern :info : [ 356.709530] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device kern :info : [ 356.709534] usb 1-6: Manufacturer: Marvell kern :info : [ 356.709537] usb 1-6: SerialNumber: 0000000000000000 kern :debug : [ 356.712861] usb 1-6: GPIO lookup for consumer reset kern :debug : [ 356.712864] usb 1-6: using ACPI for GPIO lookup kern :debug : [ 356.712869] acpi device:24: GPIO: looking up reset-gpios kern :debug : [ 356.712872] acpi device:24: GPIO: looking up reset-gpio kern :debug : [ 356.712875] usb 1-6: using lookup tables for GPIO lookup kern :debug : [ 356.712878] usb 1-6: No GPIO consumer reset found kern :err : [ 356.713658] Bluetooth: hci_event.c:hci_cmd_status_evt() hci0: unexpected event for opcode 0x0000 kern :err : [ 356.713894] Bluetooth: hci_event.c:hci_cmd_status_evt() hci0: unexpected event for opcode 0x0000 kern :info : [ 356.923852] mwifiex_pcie 0000:03:00.0: WLAN FW is active kern :info : [ 356.980149] mwifiex_pcie 0000:03:00.0: Unknown api_id: 4 kern :info : [ 357.014439] mwifiex_pcie 0000:03:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) kern :info : [ 357.014443] mwifiex_pcie 0000:03:00.0: driver_version = mwifiex 1.0 (15.68.19.p21) kern :info : [ 357.045680] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready kern :info : [ 357.155434] IPv6: ADDRCONF(NETDEV_UP): veth_mlan0: link is not ready kern :info : [ 357.158608] arc_mlan0: port 1(veth_mlan0) entered blocking state kern :info : [ 357.158611] arc_mlan0: port 1(veth_mlan0) entered disabled state kern :info : [ 357.158801] device veth_mlan0 entered promiscuous mode kern :info : [ 357.159034] arc_mlan0: port 1(veth_mlan0) entered blocking state kern :info : [ 357.159036] arc_mlan0: port 1(veth_mlan0) entered forwarding state kern :info : [ 357.169958] mlan0: renamed from peer_mlan0 kern :info : [ 357.175204] IPv6: ADDRCONF(NETDEV_CHANGE): veth_mlan0: link becomes ready kern :info : [ 360.173661] mwifiex_pcie 0000:03:00.0: info: trying to associate to '[SSID]' bssid [BSSID] kern :info : [ 360.301621] mwifiex_pcie 0000:03:00.0: info: associated to bssid [BSSID] successfully kern :info : [ 360.348800] IPv6: ADDRCONF(NETDEV_CHANGE): mlan0: link becomes ready # # wifi recovered from bad state and usable after multitple suspend # ```

kitakar5525 commented 4 years ago

Ah, make sure to revert old wifi.patch (not the patch from sebanc). It disables card reset after firmware went into bad state (because card reset caused another problem at that time).

qzed commented 4 years ago

@kitakar5525

When reverting the current wifi patch, need to inform users who don't use NetworkManager that you may need to disable wifi power_save manually

Right, we should probably put that on the wiki. https://github.com/qzed/linux-surface/wiki should be editable for everyone, if you want to have a look at that (I should probably add a "Known Issues" section).

The reset via PCI D3cold looks interesting, but I agree that there should be a better way.

chrismin13 commented 4 years ago

Hey, just wanted to let you know that the latest 5.3 release has fixed the wifi issues for me! I no longer need to run the disable aspm script after booting and even suspend seems to be working well so far. Thank you for the improvements, this has been a perfect release so far!

qzed commented 4 years ago

Thanks!

kitakar5525 commented 4 years ago

@qzed

Right, we should probably put that on the wiki. https://github.com/qzed/linux-surface/wiki should be editable for everyone, if you want to have a look at that (I should probably add a "Known Issues" section).

OK. Done at https://github.com/qzed/linux-surface/wiki/Known-Issues#mwifiex-you-may-need-to-disable-wifi-power_save-manually-if-you-dont-use-networkmanager

qzed commented 4 years ago

@kitakar5525 Awesome, thanks!

spokV commented 4 years ago

For me latest build 5.4.6-1 and all other build and fixes till now still suffer from wifi drops. Usually when using 5G with traffic load or even on 2.4 with load. Any suggestions to what else can i try?

spokV commented 4 years ago

What is the mrvl updated driver to use? there is one used in jackaday install repo....

qzed commented 4 years ago

@spokV

For me latest build 5.4.6-1 and all other build and fixes till now still suffer from wifi drops. Usually when using 5G with traffic load or even on 2.4 with load. Any suggestions to what else can i try?

Apart from @kitakar5525's suggestions in https://github.com/linux-surface/kernel/issues/20#issuecomment-568135035, I don't have anything.

What is the mrvl updated driver to use? there is one used in jackaday install repo....

Do you mean the mwlwifi driver? We dropped that because that didn't fix the issue and introduced other problems. I think it was originally included to provide an alternative to mwifiex but that never really worked out.

orychalk commented 4 years ago

Hello,

Just for information, on my Surface 3(non pro) : 1/ With kernel 5.1.15 and the script I don't have any wifi crash 2/ With Qzed kernel 5.3.18-1 and 5.4.6-1 and with the same script, I always have wifi crash

The script is disabling really ASPM L1 on my S3 :
LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- CommClk+

qzed commented 4 years ago

@orychalk Have you disabled power-save on the wifi card (see wiki entry)?

orychalk commented 4 years ago

Effectively, with power-save disabled and the script ... it's stable :) Many Thanks :)

kitakar5525 commented 4 years ago

"mwifiex crash after suspend"

The commit (linux-surface/kernel@8dbbfa6) hopefully fixed this issue on all devices that use 88W8897 chip.

"mwifiex crash on idle/using"

I heard this issue only from Surface 3 and SP5 owners.

I tried using Surface 3 with 5.4 kernel and ASPM L1 enabled. It crashed after 2.5 days (2 overnights and 3 daytime usage without suspend or reboot), when I was downloading Arch Linux package updates.

What the commit (linux-surface/kernel@8dbbfa6) does are

and those changes didn't fix this issue. L1 still needs to be disabled at least on Surface 3.

crash log

```bash mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0xa4, act = 0x0 mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0 mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0 mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1 mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0 mwifiex_pcie 0000:01:00.0: last_cmd_index = 1 mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 a4 00 16 00 a4 00 7f 00 mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 00 00 00 00 00 00 00 00 mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0 mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 7f 80 16 80 a4 80 7f 80 mwifiex_pcie 0000:01:00.0: last_event_index = 4 mwifiex_pcie 0000:01:00.0: last_event: 37 00 37 00 37 00 37 00 37 00 mwifiex_pcie 0000:01:00.0: data_sent=1 cmd_sent=1 mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0 mwifiex_pcie 0000:01:00.0: failed to get signal information mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump start=== mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) mwifiex_pcie 0000:01:00.0: PCIE register dump start mwifiex_pcie 0000:01:00.0: pcie scratch register: mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00 reg:0xcf8, value=0xa64182 reg:0xcfc, value=0x2575700 mwifiex_pcie 0000:01:00.0: PCIE register dump end mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump end=== mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump start == mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: failed to get signal information mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Ignore scan. Card removed or firmware in bad state mwifiex_pcie 0000:01:00.0: scan failed: -14 mwifiex_pcie 0000:01:00.0: Firmware dump Finished! mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump end == mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump start mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump end mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: deleting the crypto keys mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... mwifiex_pcie 0000:01:00.0: rx_pending=0, tx_pending=1, cmd_pending=0 mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed # # Trying to reset itself, but failed # mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld mwifiex_pcie 0000:01:00.0: WLAN FW is active mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0xfa, act = 0xc000 mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0 mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0 mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1 mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0 mwifiex_pcie 0000:01:00.0: last_cmd_index = 2 mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 a4 00 fa 00 a4 00 7f 00 mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 00 00 00 c0 00 00 00 00 mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0 mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 7f 80 16 80 a4 80 7f 80 mwifiex_pcie 0000:01:00.0: last_event_index = 4 mwifiex_pcie 0000:01:00.0: last_event: 37 00 37 00 37 00 37 00 37 00 mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1 mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0 mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device ```

"mwifiex reset feature broken"

No changes after the commit (linux-surface/kernel@8dbbfa6) as expected.

spokV commented 4 years ago

"mwifiex crash after suspend"

The commit (8dbbfa6) hopefully fixed this issue on all devices that use 88W8897 chip.

"mwifiex crash on idle/using"

I heard this issue only from Surface 3 and SP5 owners.

I tried using Surface 3 with 5.4 kernel and ASPM L1 enabled. It crashed after 2.5 days (2 overnights and 3 daytime usage without suspend or reboot), when I was downloading Arch Linux package updates.

What the commit (8dbbfa6) does are

* modify `mwifiex_pcie_suspend`/`mwifiex_pcie_resume` functions to fix crash after suspend

* disable `bridge_d3` to fix crash after suspend

* disable "auto deep sleep"

and those changes didn't fix this issue. L1 still needs to be disabled at least on Surface 3. crash log

"mwifiex reset feature broken"

No changes after the commit (8dbbfa6) as expected.

What release includes the fix (8dbbfa6)?

spokV commented 4 years ago

Second Qs...is there a way to recover from wifi crash except from hard reset?

kitakar5525 commented 4 years ago

What release includes the fix (8dbbfa6)?

The fix (8dbbfa6) is included in the latest 5.3/5.4 releases:

Second Qs...is there a way to recover from wifi crash except from hard reset?

Does the crash occur even under the following conditions?