openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
744 stars 343 forks source link

Huge ping value increase after running 5g wifi for some time #152

Closed updatede closed 6 years ago

updatede commented 6 years ago

device: xiaomi router 3g target: ramips/mt7621 system: OpenWrt SNAPSHOT r6150-dc7a1e8 signal strength: -60dbm I use just only one client to connect router: a ipad air2. When wlan just start: img_1880 After play game for some time, network get stucked, at the same place, close all programs on pad, ping value increase huge, won't decrease untill restart wlan on router, after restart ping value decrease to normal. img_1879

iancd commented 6 years ago

This should be expected behaviour from iOS. It would not be possible for an iPad to achieve 10 hours of battery life if it behaved like a desktop computer.

When the device is idle the wifi hardware goes into a powersave mode because it isn't critical to get data quickly. When network data is needed the wifi hardware is taken out of power save mode until network traffic is reduced to an idle state again so that power save mode is kicked back in. Source link

Maybe streaming audio from the internet in the background of your game will prevent the iPad powersaving.

updatede commented 6 years ago

no, the device is not idle and hardware does not go to powersave mode. because it's never happen when using wndr3800 with this same ipad doing same thing at same place. and this "idle" can't wake up whatever you do to ipad, unless you do restart on mt76 router. so it's mt76 driver not ios.

mastum commented 6 years ago

@updatede I can't reproduce it with my iPad, what is for you "for some time" ? Mine is connected (5 GHz) from 5 days without problems. Xiaomi 3G OpenWrt SNAPSHOT r6022

updatede commented 6 years ago

@MaFeSa for some time is a duration of time , not fixed, 3mins,5mins,10mins or just connected.

updatede commented 6 years ago

@MaFeSa what's your ping value from route to client ?

mastum commented 6 years ago

192.168.0.98 is your router IP? I will test r6150

updatede commented 6 years ago

@MaFeSa I don't know what's your client ip. you need ping from the router to client. 192.168.0.89 is my ipad's ip.

mastum commented 6 years ago

OK I understand, I was doing the opposite from iPad to router with this app.

mastum commented 6 years ago

Tested now OpenWrt SNAPSHOT r6152-f4e5880 and ping router -> iPad 5GHz is constant with ~50 ms

updatede commented 6 years ago

@nbd168 because lack of debug info, I want to add some debug info in my copy of source code to find the reason why ping value get so high to 1000 or 2000 ms, any advice for places to add ?

barhom commented 6 years ago

I can say that I see something similar on both mt7603 and mt7612. After running a while ping increases a lot. I still don't know the cause.

updatede commented 6 years ago

Still unfixed on latest snapshot r6302-0f54d96, 7% packet loss on this test

img_1884

Mushoz commented 6 years ago

Seeing similar behavior on my Dir-860L. 1000-2000ms pings, high packet loss and very slow speeds. Only rebooting the router solves the issue for me. Lede 17.01.4 is completely fine.

nbd168 commented 6 years ago

Please try to build the latest version from my staging tree at https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=summary

updatede commented 6 years ago

Because I rely on snapshot sdk to build, and this tree switch to kernel 4.14 before mac80211 patch, I don't know how to build it, I will test as soon as it commit to master tree. Thank very much @nbd168

moonwatcher commented 6 years ago

nbd, I have been experimenting with mir3g and the new 4.14 kernel. i have been seeing similar behavior. Would love to test patches. I am currently running 6c194078db6958e349114c8605304a7b87edae3e on 4.09, but I can build 4.14 with patches tomorrow if you want me to try stuff (I have 6 of those units...).

How do I build your staging, is that the entire build system or a package repo?

updatede commented 6 years ago

@nbd168 I have modified function ieee80211_amsdu_aggregate according to your patch, and compile mac80211.ko on my old snapshot sdk with 4.9.77 kernel. I replace mac80211.ko on my router which run a 4.9.86 kernel snapshot, after restart, it always popup error of wifi password not correct. I'll try with same snapshot version system and sdk later.

Mushoz commented 6 years ago

@nbd168 I am leaving for my holiday tomorrow. I will be back the 19th of March. If you still need testing to be done I will help after that. I'm sorry that I can't help at the moment.

updatede commented 6 years ago

@nbd168, I manually patched tx.c in mac80211 directory of backports-2017-11-01, and rebuild mac80211.ko file with latest snapshot sdk. After testing, the same problem still exist. But it take about 20mins to trigger, it usually take 3-5mins to trigger before this patch. There are other similar problems, move to weak signal place, client will stuck in high ping value and packet loss, move back to strong signal place won't get out of stuck. Turn off wifi or screen on client, wait about 5mins, turn on wifi or screen again, client will stuck in high ping value and packet loss either. The common problem of all above is, once high ping and packet loss appears, you can't let it disappear by any means unless restart wifi on router.

nbd168 commented 6 years ago

Any chance you could give me SSH access to your device while it's in the faulty state? Maybe I can find out a bit more about this bug that way.

updatede commented 6 years ago

@nbd168 of course, but I need private way to tell you how. is there private message at github ? it is 22:28 at my time, if you can do it tonight, I’ll tell you privately as soon as stuck appear.

nbd168 commented 6 years ago

You could send me an email: nbd@nbd.name

updatede commented 6 years ago

ok , wait my email

updatede commented 6 years ago

have you receive my email ? I have not receive any response.

nbd168 commented 6 years ago

I'm logged in and running some tests already. Thanks!

Mushoz commented 6 years ago

I am back from my skiing holiday. Do you still need testers, @nbd168 ?

updatede commented 6 years ago

@nbd168 The latest r6653-c0bbb97 still has this bug unfixed after issue opened for about two months. After analysing captured 80211 packets, I found that while my IPAD in normal state, AP often received sleep frame from my IPAD, after AP send TIM+ (TIM indicating buffered frame for IPAD) in beacon, AP received wakeup frame from my IPAD immediately. But when my IPAD enter wifi doze state(turn off screen for about 10 mins), then I turn on screen and ping my IPAD, AP can't receive my IPAD's wakeup frame right after it sent TIM+ , after sending many many TIM+, AP received wakeup frame from IPAD. No matter what you do (even reboot IPAD) , this state will not change unless restart wifi on AP. When repeat this doze then turn on process to my wndr3800 with openwrt 15.05.1, it can wake up correctly just like normal state. I don't know why. Maybe wifi doze STA let AP enter a false state and possibly:

  1. make AP can't transmit TIM+ to STA.( I can just see TIM+ in monitor interface by tcpdump, maybe these frames not really sent to air or something).
  2. make AP can't receive wakeup frame from STA.
  3. AP notify STA stay in wifi doze state forever (weird,because I can't get any ping response in doze state, but I can get ping response when turn on screen from doze state, just massive latency, don't know if there is mechanism in 80211).
Mushoz commented 6 years ago

Thank you very much for digging into this! @nbd if you need testers, please let me know.

Mushoz commented 6 years ago

I meant to tag @nbd168 :) sorry about that

nbd168 commented 6 years ago

@updatede can you make a packet capture of the exact behavior that you're describing? It would probably help with tracking down the source of that bug.

updatede commented 6 years ago

@nbd168 In fact, last time you should already get the false state packets captured on my router. You said in email that ipad takes very long to start receiving data, that's because AP receive wakeup null frame after many many TIM+ instead of one TIM+. So, If you want me capture again, what time you want me capture ? Both ping before turning off screen and after turning on screen, or just ping with massive latency after turning on screen just like last time you did ? or the whole 10 minutes process from turning off to turning on ? I use "tcpdump -i mon0 -w wifi.pcap" on AP to capture, although I can see echo replies from STA to AP on capture file, at least ping echo from AP to STA can't be captured on mon0, I don't know if any other packets missed on mon0. I don't know why this happen, maybe it's how mac80211 works. Another new discovery is that you don't need "wifi down" and "wifi" to get out false state and remove latency, you just type "ifconfig wlan1 down" and "ifconfig wlan1 up", then everything back to normal. Maybe this is a clue . Can I send capture file to your email address ? I have sent two emails to that email address a few weeks ago and get no reply.

updatede commented 6 years ago

@nbd168 Is it possible for some reason, AP defer the RX or process of packets from deep sleep STA, wakeup null frame from a deep sleep STA can't be processed on time, so AP keep TX TIM+, untill it process wakeup frame. This defering process should be stopped when a deep sleep STA is waking up or after first wakeup frame processed, so AP can TX only one TIM+ before next wakeup frame, the problem is it never stop defering. Or, deep sleep STA trigger unexpectable RX defering.

updatede commented 6 years ago

@nbd168 The same problem occur when you move to weak signal place, after you moving back to AP nearby, AP can't RX from STA as normal, AP still RX as if signal is weak, just like RX as if STA is deep sleeping when STA has waked up.

nbd168 commented 6 years ago

In the capture from the router I've seen the null data wakeup packet from the client much later than the TIM indicated client wakeup. What I didn't see is any indication that this was caused by delayed rx processing, so I wanted to see packet captures that prove that. How did you come to the conclusion that the AP is delaying rx processing?

updatede commented 6 years ago

@nbd168 Where did you find my conclusion? I just ask a question, "Is it possible" is a question sentence in english. I just point out one possibility, maybe other reasons, like weak signal , even the real reason is delayed rx processing, it can't be proved by packet captures, it can only be identified in source code. Delayed wakeup frame in packet captures is only appearance , I don't know the reason, it need to be figured out by you or someone specialized in wifi.

nbd168 commented 6 years ago

I was referring to this:

But when my IPAD enter wifi doze state(turn off screen for about 10 mins), then I turn on screen and ping my IPAD, AP can't receive my IPAD's wakeup frame right after it sent TIM+ , after sending many many TIM+, AP received wakeup frame from IPAD.

That's the reason why I asked for a capture file, since I was wondering if the iPad is really sending wakeup frames or just waking up much later.

updatede commented 6 years ago

@nbd168 Even iPad is really sending wakeup frames , there are air and many other layers between iPad and capture interface. So, whether iPad is really sending wakeup frames or just waking up much later can't be find out through packet capture on problem AP, it can only be proved by packet capture on iPad. Since it can't be done directly, it can be done indirectly by turning off screen, 10 mins later turning on screen, and capture packets on another AP. Through capturing on my wndr3800 running openwrt 15.05.1, proved iPad can sending wakeup frame right after first TIM+ when turn on screen after 10mins doze. But this only proved problem not on iPad, still not know how this happen, problem AP drop wakeup frames ? or problem AP let iPad send wakeup frames much later by very weak signal? This can only be find out by guessing, modify source code and test. I'm sure that problem is on AP. If problem is on iPad, latency will disappear eventually after some first packets and this will happen with other APs either. If problem is on iPad, turn off wifi on iPad or reboot iPad will reset everything and the latency will disappear. But with problem AP, when turning on screen after 10 mins doze, latency will exist forever, switch off and on wifi on iPad, reboot iPad, all does not work. Only reset wifi on AP will work.

nbd168 commented 6 years ago

Please try the latest version

updatede commented 6 years ago

@nbd168 It improves since commit 7a31731. Turn off screen for 10 mins seems have no latency any more. But for 15~20 mins, the same problem is back. I checked ath9k code , when sta is about sleeping, ath9k call ieee80211_sta_set_buffered with "true", and it has a ath_tx_aggr_sleep function that call ieee80211_sta_set_buffered either, after releasing buffer, it call ieee80211_sta_set_buffered with "false". But in mt76 code , there is only place call ieee80211_sta_set_buffered with "false", no "true" call by checking sleeping, is it the reason of issue ?

updatede commented 6 years ago

"If the driver buffers frames in the driver for aggregation in any way, it must use the ieee80211_sta_set_buffered() call when it is notified of the station going to sleep to inform mac80211 of any TIDs that have frames buffered." from Linux 802.11 Driver Developer’s Guide. Maybe need some aggregation-sleeping logic.

updatede commented 6 years ago

@nbd168 I add some printk() in mt76_txq_schedule_list(), the doze time need to trigger bug increase again. So, maybe it's newly added code increase the execution time of mt76_txq_schedule_list(), and the trigger of the bug need to call mt76_txq_schedule_list() a number of times.

updatede commented 6 years ago

@nbd168 I reverted mt76 driver to first commit 9e7049b "Initial import" and the same problem still exist. I made some modification to fit latest mac80211/cfg80211 because I couldn't get old mac80211 source code. After that I wonder if mir3g's hardware is not compatible with ios or ipad air2 hardware, so I restore mir3g to stock firmware, the issue disppeared. So I think bug is in initial driver code, or mac80211 code before/after "Initial import".

updatede commented 6 years ago

@nbd168 I disable 80211n/ac by setting mode to legacy to see if issue still exist. From captures I can see ping frame from AP to STA now because of no aggregation. When bug triggered I saw once a ping frame sent to STA before STA sent wakup null frame to AP. If it's not a monitor interface issue, I think maybe it's the reason of this bug. I've sent packet capture to your email address.

updatede commented 6 years ago

@nbd168 I have some new findings about this bug. I find new ways to reproduce the symptom:

  1. Switch off wifi on iPad air2 , restart wifi on AP, don't connect any STA to AP, let AP run wifi for 15~20 mins, then connect iPad air2 to AP, the symptom appears.
  2. Restart wifi on AP, connect iPad air2 to AP, keep pinging from AP to iPad air2 for 15-20 mins, the symptom appears. So, even we can see later wakeup null frame, but this bug is not about sleep, not about sleep time, it's only about running time of AP after restarting wifi. Old version ipad that only support 80211N seems not affected , not very sure. Android client that support 80211AC also affected, ping echo time after triggered is 500-600ms, it's less than iPad's but significantly larger than normal.
updatede commented 6 years ago

@nbd168 I don't know if you test after my last comment. But I found that changing "beacon_int" can change bug trigger time. For example, change "beacon_int" to 50 will trigger this bug 10 mins after restarting wifi, 100 is 20 mins, 200 is 40 mins, and so on. So I think this bug is in beacon or packet sending related code.

updatede commented 6 years ago

this project does not want to fix this bug and i switch to stock driver to abandon this driver.

nbd168 commented 6 years ago

I do want to fix this bug, I just don't have much time on my hands at the moment.

updatede commented 6 years ago

no, thanks

jsantala commented 6 years ago

Just keep the ticket open, please. There's other people who want to see this fixed. Thanks!

Mushoz commented 6 years ago

Agreed. Please leave this ticket open. There is valuable information in here and there are more people who would like to see this fixed.