kaloz / mwlwifi

mac80211 driver for the Marvell 88W8864 802.11ac chip
395 stars 119 forks source link

Excessive softirq #43

Closed northbound1 closed 8 years ago

northbound1 commented 9 years ago

Since no one else is going to start this issue :) root@OpenWrt:~# grep mwl /proc/interrupts ; sleep 1; grep mwl /proc/interrupts 87: 1282535 0 armada_370_xp_irq 59 mwlwifi 88: 124531322 0 armada_370_xp_irq 60 mwlwifi 87: 1284531 0 armada_370_xp_irq 59 mwlwifi 88: 124533305 0 armada_370_xp_irq 60 mwlwifi htop also shows 37% core0 with no load 60%+ when both radios are enabled Could the checks be cut in half or 25%

davidc502 commented 9 years ago

Agreed. Monitoring is showing 60% utilization, on core0, though the router is idle. If I turn off wifi (both of them), CPU returns to normal.

This doesn't seem to affect performance.

notnyt commented 9 years ago

I see the same, openwrt trunk and 10.3.0.12

yuhhaurlin commented 9 years ago

Thanks. I will use timer task to replace queue empty interrupt to flush AMSDU packets.

northbound1 commented 9 years ago

@yuhhaurlin Thanks for the work I can almost see the light at the end of the tunnel. :) And ALL of the others that help to bring it together!

gufus commented 9 years ago

@yuhhaurlin

Thanks for everything. :+1:

johnnysl commented 9 years ago

Initial tests show me that the new driver shows a way way higher Sirq-load during heavy Wifi usage?

Device: WRT1900AC-V2

notnyt commented 9 years ago

Confirmed.

CPU0:  0.0% usr  0.0% sys  0.0% nic 11.3% idle  0.0% io  0.0% irq 88.6% sirq
CPU1:  0.1% usr  0.3% sys  0.0% nic 60.1% idle  0.0% io  0.0% irq 39.2% sirq
notnyt commented 9 years ago

I think this might be a little tight.

johnnysl commented 9 years ago
every clock tik?
notnyt commented 9 years ago

What load were you seeing with the other driver while performing a transfer? I didn't test before hand.

johnnysl commented 9 years ago

at max: 20%, now over 80% for the same amount of wifi traffic

yuhhaurlin commented 9 years ago

I think it is correct to use queue empty interrupt to flush AMSDU packets. Although it generates lot interrupts when system is idle, but it will not generate any interrupts when wifi is heavy. I will change it back and release 10.3.0.14.

johnnysl commented 9 years ago

@yuhhaurlin : the whole discussion comes from the fact that the driver seems to create excessive load on a V1 WRT1900AC, when idle. This blamed on the high amount of Interrupts created every second. Yet with the V2-version, there is no CPU-load when idle (but still a lot of interrupts seen in /proc/interrupts).

So ideally you want a driver that does not create any CPU-load when idle, and does not create an excessive load either when moving data via WiFI. Personally i have no clue if the huge interrupts are related to the high CPU load on the V1, or that there is another bug in the driver/kernel/whatever.

yuhhaurlin commented 9 years ago

If wifi is idle, even though more interrupts are generated, it should not do lot of things in the flush function.

notnyt commented 9 years ago

Please update the issue after you commit and I will test again right away. Thanks

gufus commented 9 years ago

1900ac v1

wifi driver 10.3.0.12

Both mwlwifi 2g and 5g on CPU1

CPU0: 0.1% usr 0.5% sys 0.0% nic 99.2% idle 0.0% io 0.0% irq 0.0% sirq CPU1: 0.9% usr 1.5% sys 0.0% nic 36.6% idle 0.0% io 0.0% irq 60.7% sirq

TireMeat commented 9 years ago

@yuhhaurlin ... just a thought -- in mwl_tx_done and other functions the lock/unlocks are very far apart (code-wise, lots of things happening), judging by the high soft IRQ count, (Rescheduling and Single Func), could it be possible that there is too much processing happening in between the locks so that the kernel defers calls (hence creating the Rescheduling interrupts)?

What would happen if we tighten the lock/unlock to just the commands needed in the lock?

notnyt commented 9 years ago

.14 is worse under load than .13. I think you had it better with the timer.

Mem: 51352K used, 203708K free, 124K shrd, 4628K buff, 11736K cached CPU0: 0.0% usr 0.0% sys 0.0% nic 7.2% idle 0.0% io 0.0% irq 92.7% sirq CPU1: 0.0% usr 1.4% sys 0.0% nic 24.6% idle 0.0% io 0.0% irq 73.9% sirq

notnyt commented 9 years ago

I actually tested these both under load, and they have similar sirq usage. However, .13 is MUCH better at idle. I'd suggest sticking with the timer as opposed to the queue empty interrupt.

Thoughts?

yuhhaurlin commented 9 years ago

I think marvell will keep queue empty interrupt to flush AMSDU packets.

yuhhaurlin commented 9 years ago

Please based on 10.3.0.14 to see if you still encounter issues listed here. I will check open issues later and try to close them if they will not happen on 10.3.0.14. Thanks.

johnnysl commented 9 years ago

I'll check .14 on my v2 later today. What i don't understand is why he behavior is different on the different hardware versions.

johnnysl commented 9 years ago

@yuhhaurlin 10.3.0.14 on a WRT1900AC-V2 lowers the cpu temperature by about 15 degrees, and no more high load during heavy wifi transfers. CPU load is ~0 when idling as well, so this looks good for me.

yuhhaurlin commented 9 years ago

Thanks for your information.

notnyt commented 9 years ago

At idle

root@ZOMGWTFBBQWIFI:/# dmesg | grep 10.3.0
[   18.862598] <<Marvell 802.11ac Wireless Network Driver version 10.3.0.14>>
root@ZOMGWTFBBQWIFI:/# uptime
 09:50:45 up 17 min,  load average: 0.00, 0.01, 0.04
root@ZOMGWTFBBQWIFI:/# sensors
armada_thermal-virtual-0
Adapter: Virtual device
temp1:        +68.2°C

tmp421-i2c-0-4c
Adapter: mv64xxx_i2c adapter
temp1:        +54.3°C
temp2:        +55.0°C

CPU0:  0.0% usr  3.1% sys  0.0% nic 39.0% idle  0.0% io  0.0% irq 57.8% sirq
CPU1:  0.0% usr  0.0% sys  0.0% nic  100% idle  0.0% io  0.0% irq  0.0% sirq
notnyt commented 9 years ago

with 2.4ghz wan to wifi xfer and 5ghz lan to wifi. Is this within safe operating limits?

root@ZOMGWTFBBQWIFI:/# uptime
 10:20:23 up 47 min,  load average: 0.43, 0.38, 0.24

root@ZOMGWTFBBQWIFI:/# sensors
armada_thermal-virtual-0
Adapter: Virtual device
temp1:        +87.7°C

tmp421-i2c-0-4c
Adapter: mv64xxx_i2c adapter
temp1:        +66.9°C
temp2:        +78.1°C

CPU0:  0.0% usr  0.0% sys  0.0% nic  7.8% idle  0.0% io  0.0% irq 92.1% sirq
CPU1:  0.8% usr  1.7% sys  0.0% nic 34.7% idle  0.0% io  0.0% irq 62.6% sirq
johnnysl commented 9 years ago

@notnyt I just have no idea why a V1 shows such a different behavior. Was this always the case? so also with 10.3.0.3, or was this introduced in some version? i'm getting the idea that the load issue you see might be because of some other bug.

this is my temp during a load test: /# sensors tmp421-i2c-0-4c Adapter: mv64xxx_i2c adapter temp1: +44.4°C temp2: +46.2°C

armada_thermal-virtual-0 Adapter: Virtual device temp1: +72.7°C

/# top (during idle) Mem: 103284K used, 411740K free, 1804K shrd, 38620K buff, 18584K cached CPU: 0% usr 0% sys 0% nic 98% idle 0% io 0% irq 1% sirq Load average: 0.03 0.04 0.05 2/82 29752

notnyt commented 9 years ago

I wasn't monitoring this closely with 10.3.0.3, but I do not believe this was the case. I believe this began with 10.3.0.8, but I could not run it due to BA issues. Which kernel are you using?

What are you doing for a load test? My test was wan to wifi on 2.4ghz at 120mbps and lan to wifi on 5ghz at about 200mbps.

As for idle, it seems v1 always experience 50-60% sirq since then.

johnnysl commented 9 years ago

load-test: run iperf3 with a N-300 client on 5ghz, against an iperf3 server on the lan. i test traffic both up and down, and the speed around 160mbits.

/# uname -a Linux MYROUTR 3.18.23 #3 SMP Fri Nov 6 11:50:11 CET 2015 armv7l n

i'm running DD bleeding edge r47389

notnyt commented 9 years ago

@yuhhaurlin any idea why v2 isn't showing this problem but v1 is?

northbound1 commented 9 years ago

I am just a peon but why when the hardware acceleration is enabled ..Since .8? The cpu load increases with sirq, I thought it would be off loaded. Or am I in left field?

yuhhaurlin commented 9 years ago

Top uses timer tick to collect information. It means it collects information on the rate 100 times per second. For regular periodical event, top may not report correct information.

yuhhaurlin commented 9 years ago

I think you only need to care top information when wifi traffic is heavy.

northbound1 commented 9 years ago

If not showing both cores top shows an avg. of both at least for me. Htop on the other hand shows both. To me it is the that heat tells the story 10.3.0.3 at idle was close to nil fan never ran.. 10.3.0.8 since has heated me up to 60c and cycled the fan @ 10 to 15 min cycles. This is at idle.

northbound1 commented 9 years ago

I guess what I am saying is why is it not the wifi getting hot since the load should be going there. Instead of the cpu.

TireMeat commented 9 years ago

Does anyone know how much overhead kernel debugging (debugfs?) add to overhead?

(BTW, 10.3.0.10 has the same system resource performance profile = ~4K+ int/sec and ~250 context switch/sec)

root@WRT1900ac:/# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 163972  11296  24084    0    0     0     0 2075  103  1 30 69  0
 0  0      0 163972  11296  24084    0    0     0     0 4245  258  0 27 73  0
 0  0      0 164028  11296  24084    0    0     0     0 4261  225  1 29 70  0
 0  0      0 163976  11296  24084    0    0     0     0 4185  252  0 24 76  0
 0  0      0 164032  11296  24084    0    0     0     0 3103  241  0 31 69  0
 0  0      0 163972  11296  24084    0    0     0     0 4235  207  0 28 72  0
northbound1 commented 9 years ago

This is what I see running guface's fan script you can see the fan cycles. This is at idle. https://onedrive.live.com/redir?resid=E71014BBAA19358A!8052&authkey=!ALdIv8UJZS26DW0&v=3&ithint=photo%2cJPG

notnyt commented 9 years ago

Trying to blame top for anything is absurd. The busybox version in openwrt sleeps for 5 seconds at a time then reads the data from /proc. You can trigger an update by pressing enter. Pressing 1 shows individual cpu stats.

driver 10.3.0.13 shows no sirq at idle, yet 10.3.0.14 shows thousands per second. CPU temperature at idle is approximately 70C

root@ZOMGWTFBBQWIFI:~# grep mwlwifi /proc/interrupts ;sleep 1;grep mwlwifi /proc/interrupt
s
 87:  164636645          0  armada_370_xp_irq  59  mwlwifi
 88:  160197843          0  armada_370_xp_irq  60  mwlwifi
 87:  164638651          0  armada_370_xp_irq  59  mwlwifi
 88:  160199865          0  armada_370_xp_irq  60  mwlwifi

root@ZOMGWTFBBQWIFI:~# uptime
 09:01:34 up 23:28,  load average: 0.11, 0.07, 0.06
root@ZOMGWTFBBQWIFI:~# sensors
armada_thermal-virtual-0
Adapter: Virtual device
temp1:        +70.4°C

tmp421-i2c-0-4c
Adapter: mv64xxx_i2c adapter
temp1:        +55.2°C
temp2:        +56.1°C
northbound1 commented 9 years ago

10.3.0.13 is an improvement I can now run both radios and stay cool. root@OpenWrt:~# grep mwlwifi /proc/interrupts ;sleep 1;grep mwlwifi /proc/interrupts 87: 29265 0 armada_370_xp_irq 59 mwlwifi 88: 1043990 0 armada_370_xp_irq 60 mwlwifi 87: 29318 0 armada_370_xp_irq 59 mwlwifi 88: 1044030 0 armada_370_xp_irq 60 mwlwifi root@OpenWrt:~# sensors tmp421-i2c-0-4c Adapter: mv64xxx_i2c adapter temp1: +45.1°C temp2: +48.1°C

armada_thermal-virtual-0 Adapter: Virtual device temp1: +53.7°C

johnnysl commented 9 years ago

@northbound1: well 10.0.3.13 does exactly the reverse on a V2, it's 10.3.0.14 that runs quite well on that platform. So i'm not convinced that the interrupts logged in /proc/interrupts are the cause of the high cpu. It might also be some other change in the driver.

lantis1008 commented 9 years ago

.14 running on a V1 with low cpu temp and high idle sirq

10.3.0.14

root@Test:~# grep mwlwifi /proc/interrupts; sleep 1; grep mwlwifi /proc/interrupts 87: 32586446 0 armada_370_xp_irq 59 mwlwifi 88: 32560650 0 armada_370_xp_irq 60 mwlwifi 87: 32587931 0 armada_370_xp_irq 59 mwlwifi 88: 32562128 0 armada_370_xp_irq 60 mwlwifi

root@Test:~# uptime 22:25:59 up 4:51, load average: 0.00, 0.01, 0.04

Cpu: 55°C Ram: 42°C Wifi: 45°C

Using modified fan script based on gufus implementation.

northbound1 commented 9 years ago

Back to 10.3.0.13 .14 is tolerable on 1 radio but with both up the fan cycles. There is no real difference in iperf no difference in usb3 transfer. I can't see idling at 25% with no load. That makes no sense to me.

gufus commented 9 years ago

1900ac v1

wifi driver 10.3.0.12

Both mwlwifi 2g and 5g on CPU1

Runs like this MOST of the time.

CPU0: 0.7% usr 0.3% sys 0.0% nic 98.8% idle 0.0% io 0.0% irq 0.0% sirq CPU1: 0.5% usr 0.9% sys 0.0% nic 34.8% idle 0.0% io 0.0% irq 63.5% sirq

Then once-in-awhile...

CPU0: 0.1% usr 0.3% sys 0.0% nic 99.4% idle 0.0% io 0.0% irq 0.0% sirq CPU1: 0.7% usr 0.5% sys 0.0% nic 80.0% idle 0.0% io 0.0% irq 18.5% sirq

queue empty interrupt to flush AMSDU packets WORKS

I'm sure wifi driver 10.3.0.14 will be the SAME, it's the same code as wifi driver 10.3.0.12

northbound1 commented 9 years ago

@gufus have you tried .13?

gufus commented 9 years ago

From: northbound1 [mailto:notifications@github.com] Sent: Sunday, November 08, 2015 2:34 PM To: kaloz/mwlwifi Cc: gufus

Subject: Re: [mwlwifi] Excessive softirq (#43)

@gufus have you tried .13?

Nope.

— Reply to this email directly or view it on GitHub.

northbound1 commented 9 years ago

You should since you also have a v1. Just to see if you see any real difference in iperf or anything else. I see no real difference. Except for the cpu doing what it should at idle which is next to nothing.

northbound1 commented 9 years ago

This is what you should see in collectd when idle with both radios up. wrt1900ac v1 10.3.0.13 10 3 0 13 idle cpu 10 3 0 13 idle sensors

gufus commented 9 years ago

1900ac v1

Using username "root". DD-WRT v3.0-r28112 std (c) 2015 NewMedia-NET GmbH Release: 11/10/15 Authenticating with public key "rsa-key-20120810"

BusyBox v1.24.1 (2015-11-10 00:30:38 CET) built-in shell (ash)

root@AC-DD-WRT:~# strings /lib/ath9k/mwlwifi.ko | grep 10.3 10.3.0.14 10.3.0.14 root@AC-DD-WRT:~#

2 clients on 2.4ghz 1 client on 5ghz

CPU0: 1.5% usr 0.9% sys 0.0% nic 56.5% idle 0.0% io 0.0% irq 40.8% sirq CPU1: 0.1% usr 0.5% sys 0.0% nic 83.2% idle 0.0% io 0.0% irq 15.9% sirq

CPU 67.4 °C / WL0 50.7 °C / WL1 52.5 °C

NO performance tweaks

gufus commented 8 years ago

1900ac v1

wifi driver 10.3.0.14

Both mwlwifi 2g and 5g on CPU1

queue empty interrupt to flush AMSDU packets WORKS

wifi driver 10.3.0.14 is the SAME

MOST of the time

CPU0: 0.9% usr 0.7% sys 0.0% nic 98.2% idle 0.0% io 0.0% irq 0.0% sirq CPU1: 0.1% usr 0.3% sys 0.0% nic 39.7% idle 0.0% io 0.0% irq 59.6% sirq

once-in-awhile

CPU0: 0.5% usr 0.3% sys 0.0% nic 99.0% idle 0.0% io 0.0% irq 0.0% sirq CPU1: 1.3% usr 1.3% sys 0.0% nic 78.6% idle 0.0% io 0.0% irq 18.5% sirq

northbound1 commented 8 years ago

@Calvin Finch If you get it to work on a v2 I am game to try it on a v1 I would be great to see something that works properly on both. Edit: Since your post is no longer here But I got the e-mail I will reply here.