openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
737 stars 341 forks source link

mt7615e: frame drop related to adhoc interface #661

Open yogo1212 opened 2 years ago

yogo1212 commented 2 years ago

Brief description

I have a permanent mesh setup with two Cudy WR2100 routers (mt7621). My image was built from openwrt trunk (42626aef).

For for longer periods of time (talking hours or days), the mesh connection is either broken or works. Most of the time, when both routers are rebooted or do they an update, they come back with the connection up. "Connection up" meaning that batman forwards traffic in both directions. When the connection doesn't come up, it might start working after a day or two.

Investigation so far

The IBSS connection is up when the connection was down (batman-wise). Station dump yields sane data. No failed transmissions on either side. Ping works (pinging the IPv6 link-local addresses). Most of the time when the connection is down, one router doesn't list the other using batctl o (when it works all originators are present on both sides).

One shows this:

[root@997 ~] $ batctl o
[B.A.T.M.A.N. adv 2022.0-openwrt-2, MainIF/MAC: mesh24_0/b4:4b:d6:26:ae:0c (bat0/62:ae:56:15:62:a1 BATMAN_IV)]
   Originator        last-seen (#/255) Nexthop           [outgoingIF]

The other one this:

[root@999 ~] $ batctl o
[B.A.T.M.A.N. adv 2022.0-openwrt-2, MainIF/MAC: mesh24_0/b4:4b:d6:26:ae:08 (bat0/26:40:e1:9d:90:e6 BATMAN_IV)]
   Originator        last-seen (#/255) Nexthop           [outgoingIF]
 * b4:4b:d6:26:ae:0c    2.510s   ( 19) b4:4b:d6:26:ae:0c [  mesh24_0]

I added a monitoring interface using iw and created a dump while the connection was down. The OGM frames from batman were seen going out by the sender but they were not received by the other router. Again, other frames made it through. Only the OGM frames were missing - pings, TVLV made it through. The sender was happily receiving OGM frames from the other end. For that reason I think this is an issue with the driver rather than with batman.

I used this filter: !(batadv.batman.packet_type == 1 || batadv.batman.packet_type == 66) (to make it easier to illustrate that only OGM frames are affected). The openwrt forum doesn't accept pcap uploads and I lost the files from when I made the post there to a reboot (clearing /tmp). Please ping me if I should create another set of pcaps (EDIT 27.06.22: I can no longer. I mean.. I could.. but it would show only outgoing frames).

The wireless config is absolutely identical.

wifi config ``` config wifi-device 'radio24_0' option type 'mac80211' option path '1e140000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0' option band '2g' option country 'DE' option phyref 'phy0' option channel '6' option htmode 'HT40' option airtime_mode '2' config wifi-iface 'w24_0_13ae' option mode 'ap' option ifname 'nw24_0-13ae' option device 'radio24_0' option network 'n_1' option encryption 'psk2' option isolate '0' option hidden '0' option key 'REDACTED' option ieee80211r '0' option ieee80211w '1' option ssid 'wir haben kein'\'' schnee lan' option airtime_bss_weight '100' config wifi-iface 'mesh24_0' option ifname 'mesh24_0' option network 'mesh24_0' option encryption 'psk2+ccmp' option mesh_id 'mesh ssid' option bssid '02:BA:CC:22:13:37' option key 'REDACTED' option mode 'mesh' option mesh_fwding '0' option mesh_ttl '1' option airtime_bss_weight '60000' option device 'radio24_0' ```

The issue is reproducible with another pair of routers in a different network (different country, even).

howl commented 2 years ago

Perhaps is related to #518.

yogo1212 commented 2 years ago

@howl interesting! quite the read.. i've created an access point ssid and connected my phone to it to see whether i could reproduce @ptpt52 's issue with the queues.

simultaneous speedtests, random disconnect of clients - but so far, everything appears to be normal.

i'll try with multiple mesh peers next

yogo1212 commented 2 years ago

the issue appeared again - this time with a mesh point network.

this is how queues look on both ends:

$ for f in /sys/kernel/debug/ieee80211/phy0/mt76/*queues; do echo "${f##*/}:"; cat "$f"; done
rx-queues:
     queue | hw-queued |      head |      tail |
         0 |       127 |        22 |        23 |
         1 |        63 |        56 |        57 |
xmit-queues:
     queue | hw-queued |      head |      tail |
         0 |         0 |       148 |       148 |
         1 |         0 |         2 |         2 |
         2 |         0 |        87 |        87 |
         3 |         0 |         0 |         0 |
         4 |         0 |         0 |         0 |
         5 |         0 |        28 |        28 |
         6 |         0 |         7 |         7 |

i've noticed another strange thing: when in the error state, scanlist libiwinfo doesn't scan :thinking: iw->scanlist("phy0", buf, &len) returns != 0. (EDIT: me dumb. wrong router)

yogo1212 commented 2 years ago

ok. the problem definitely has to do with an adhoc interface existing. removing the adhoc interface resolved the issue with the mesh point network as well. also, iwinfo scans again.

this feels a lot like an issue i had with ipq40xx a while ago where traffic just halted when there was an access point and a mesh point on one radio at the same time. only here, it works with mesh point and it doesn't with adhoc.

i think i'll just leave it at "don't use adhoc with mt7621".

howl commented 2 years ago

@yogo1212 I think you should reopen it and change the issue title to reflect your research.

yogo1212 commented 2 years ago

@howl sorry, i was severely behind on my slacking schedule.

quick resume from the top of my head:

what makes this so spicy is that the issue isn't persistent. it can work for a few days and then just stop seemingly at random for another few days before it resumes.