openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
746 stars 342 forks source link

mt7615: wifi restart issue using dbdc mode #483

Open porentak opened 3 years ago

porentak commented 3 years ago

Hello,

I'm investigating an issue after wifi interfaces restart, where clients can not connect to APs.

CPU board: UniElec 7621-06 WiFi: BPI-7615 (running in DBDC mode) OpenWRT: cfbda6627956af0cab380d03fd9275574e67921e (1.12.2020) MT76 driver: 066cc441eb8fcec7a3aeb6a320f5f9e6c21790f1 (21.11.2020)

Configuration is very simple: 1 AP on each radio, unique BSSID, unique SSID, without encryption, fixed channels (if any help I can attach configuration) After reboot both APs are working (2.4GHz and 5GHz) just fine: beacons are visible in the air (using external sniffer), clients can connect to both APs, ....

Now, the fun begins. After running script /sbin/wifi strange things starts.

In some rare cases (1 of 5) everything is working just fine. But in other cases WiFi clients can not connect to any AP. In logs I can see this: Wed Dec 2 13:07:08 2020 daemon.info hostapd: wlan1: STA 60:8b:0e:08:a9:99 IEEE 802.11: authenticated Wed Dec 2 13:07:08 2020 daemon.info hostapd: wlan1: STA 60:8b:0e:08:a9:99 IEEE 802.11: associated (aid 1) Wed Dec 2 13:07:08 2020 daemon.notice hostapd: wlan1: AP-STA-CONNECTED 60:8b:0e:08:a9:99 Wed Dec 2 13:07:08 2020 daemon.info hostapd: wlan1: STA 60:8b:0e:08:a9:99 RADIUS: starting accounting session 7BB47469AFF75D02 Wed Dec 2 13:07:10 2020 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED 60:8b:0e:08:a9:99 Wed Dec 2 13:07:10 2020 daemon.info hostapd: wlan1: STA 60:8b:0e:08:a9:99 IEEE 802.11: disassociated Wed Dec 2 13:07:11 2020 daemon.info hostapd: wlan1: STA 60:8b:0e:08:a9:99 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)

Enabling hostapd debug logs (-d): Wed Dec 2 13:14:39 2020 daemon.debug hostapd: wlan1: Event RX_MGMT (18) received Wed Dec 2 13:14:39 2020 daemon.debug hostapd: mgmt::disassoc Wed Dec 2 13:14:39 2020 daemon.debug hostapd: disassocation: STA=60:8b:0e:08:a9:99 reason_code=8 Wed Dec 2 13:14:39 2020 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED 60:8b:0e:08:a9:99

It is the client who is sending disconnect frame to APs. But why?

Investigating it further, it turned out, that there are no beacon frames in the air from my APs (in 100ms interval) when issue arise. But both APs do respond to Probe Request frames.

By my understanding MT7615 sends out beacon frames without CPU intervention. CPU has to prepare beacon frame content and give it to MCU for periodic transmit. If this is correct, my additional debugs could help to solve this. I have added debugs to mt7615_mcu_add_beacon_offload function. After reboot I see this values (every 6 seconds): req: omac_idx: 0x00, enable: 0x1, wlan_idx: 0x0, band_idx: 0x1 req: omac_idx: 0x11, enable: 0x1, wlan_idx: 0x0, band_idx: 0x0 If system is working fine, I get same output. If not, I get this: req: omac_idx: 0x00, enable: 0x1, wlan_idx: 0x0, band_idx: 0x0 req: omac_idx: 0x11, enable: 0x1, wlan_idx: 0x0, band_idx: 0x1 Can someone help me understand what happens in non working case. Is there interface bring up order issue?

Other things I have tried already:

I'm not even sure it is mt76 issue.

Any suggestions, ideas, directions, debugging ideas are welcome.

Hurricos commented 3 years ago

have added debugs to

Might help to post patch of the debugs or link to your patch.

You're right, it does sound like an interface bring-up order issue. From reading your comment, booting the system can cause one of two states -- omac_idx: 0x00 for band_idx: 0x1 (good), and omac_idx: 0x11 for band_idx: 0x1 (bad). When WiFi is reset later, this causes the malfunction.

Commit cd795267 suggests the mt7615 chip shares a global pool of 32 omacs between the two wireless PHYs. This commit increased the total number of VIFs but may have messed with interface bringup order, not sure precisely.

Try building OpenWrt with cd79526 reverted -- I think it should be

Hurricos commented 3 years ago

I don't see an obvious way to git bisect one package from the main OpenWrt repo, since commits are squashed under package/kernel/mt76. I would still try git bisecting the whole OpenWrt tree. You should end up at one squash of commits from under git log -p package/kernel/mt76, at which point you might be able to quickly script a replay of those commits (revert the commit, then format-patch each listed squashed mt76 commit and drop it into patches-5.4/)

Then you could bisect that.

porentak commented 3 years ago

@Hurricos

I reverted commit cd79526 and it does not help. Results are the same.

porentak commented 3 years ago

To get all the bug and bug fixes I have migrate to latest OpenWrt commit (20d847d1338f716fc9f143f633b6f79ba6017b5c), where mt76 driver (4a90fdf61) is used.

As @Hurricos suggested, I'm adding my debug patch: --- a/mt7615/mcu.c +++ b/mt7615/mcu.c @@ -725,6 +725,9 @@ mt7615_mcu_add_beacon_offload(struct mt7615_dev *dev, info->hw_queue |= MT_TX_HW_QUEUE_EXT_PHY; } ` + printk("req: omac_idx: 0x%02x, enable: 0x%02x, wlan_idx: 0x%02x, band_idx: 0x%02x\n\tvif->addr: %pM\n", + req.omac_idx, req.enable, req.wlan_idx, req.band_idx, vif->addr); + mt7615_mac_write_txwi(dev, (__le32 *)(req.pkt), skb, wcid, NULL, 0, NULL, true); memcpy(req.pkt + MT_TXD_SIZE, skb->data, skb->len);`

To reproduce this issue, I found even easier procedure. At boot time, both APs are enabled. Here, we have two scenarios depending which interface is enabled first.

  1. first wlan0, second wlan1

hostapd: Configuration file: /var/run/hostapd-phy0.conf (phy wlan0) --> new PHY [ 60.095868] req: omac_idx: 0x00, enable: 0x01, wlan_idx: 0x00, band_idx: 0x00 [ 60.095868] vif->addr: 00:11:22:01:05:bc hostapd: Configuration file: /var/run/hostapd-phy1.conf (phy wlan1) --> new PHY [ 62.146363] req: omac_idx: 0x11, enable: 0x01, wlan_idx: 0x00, band_idx: 0x01 [ 62.146363] vif->addr: 00:11:22:01:05:bd

Execute: wifi down

At this point, if I disable radio1 in uci (option disabled '1') and restart wifi (wifi) I get: req: omac_idx: 0x00, enable: 0x01, wlan_idx: 0x00, band_idx: 0x00 vif->addr: 00:11:22:01:05:bc wifi on radio0 is working.

If instead of disabling radio1, I disable radio0 and restart wifi I get: req: omac_idx: 0x00, enable: 0x01, wlan_idx: 0x00, band_idx: 0x01 vif->addr: 00:11:22:01:05:bd and wifi on radio1 is not working.

  1. first wlan1, second wlan0 In this case, result is opposite as described in first case. Configuration file: /var/run/hostapd-phy1.conf (phy wlan1) --> new PHY [ 23.500227] req: omac_idx: 0x00, enable: 0x01, wlan_idx: 0x00, band_idx: 0x01 [ 23.500227] vif->addr: 00:11:22:01:05:bd Configuration file: /var/run/hostapd-phy0.conf (phy wlan0) --> new PHY [ 24.769704] req: omac_idx: 0x11, enable: 0x01, wlan_idx: 0x00, band_idx: 0x00 [ 24.769704] vif->addr: 00:11:22:01:05:bc

Disabling radio1 req: omac_idx: 0x00, enable: 0x01, wlan_idx: 0x00, band_idx: 0x00 vif->addr: 00:11:22:01:05:bc wifi on radio0 is not working.

Disabling radio0 req: omac_idx: 0x00, enable: 0x01, wlan_idx: 0x00, band_idx: 0x01 vif->addr: 00:11:22:01:05:bd wifi on radio1 is working.

From this debugs it looks like issue ocures if omac_idx is changed since startup/first init. Is this root cause or just consequence, I don't know.

Thanks for additional tips/directions.

ryderlee1110 commented 3 years ago

My first guess is we have to keep a main radio alive (the first wlan in your case). I can't remember the design details. (so not sure yet)

porentak commented 3 years ago

@ryderlee1110 If it helps...

updated to latest mt76 driver (abdd471e9f2d5c2287c095df58f32432dc0ceb00, Jan 5, 2021)

I've boot up router with radio0 enabled and one wifi-iface on it, while radio1 and wifi-iface on it are disabled. radio0 and AP on it is working fine. wifi down Update configuration to disabled radio0 and enable radio1. wifi AP on radio1 is not working.

Trying to be helpful, but probably going into wrong direction, I tried to fix omac_idx per band:

ryderlee1110 commented 3 years ago

I meant firmware seems to have a strict order. Or, you can try in-house driver (if you can) to double confirm it.

porentak commented 3 years ago

I don't think this is true.

If I enable only radio1 and reboot the router, it works just fine.

ryderlee1110 commented 3 years ago

What I'm saying is the first interface you enable regardless of radios, so radio1 should be first interface after reboot, right? Just suspect firmware using the first wlan driver set as main radio.

porentak commented 3 years ago

I tried with in-house driver. And it works just fine. I tried multiple orders of enabling/disabling interfaces/radios.

While doing that, I think I found the difference between both drivers. In in-house driver after both interfaces are disabled and new interface is enabled MT7615 chip is reinitialized (firmware reload, ...).

porentak commented 3 years ago

@ryderlee1110 did you manage to find some time to dig into this issue?

ryderlee1110 commented 3 years ago

There's a strict interface order for mt7615. I don't think this is an issue.

ryderlee1110 commented 3 years ago

hi, I recently tested dbdc on mt7915d. Can you check if mt7615d work with "wifi reload" and check if ieee80211_start_ap() -> mt7915_bss_info_changed() are called?

porentak commented 3 years ago

hi, I recently tested dbdc on mt7915d. Can you check if mt7615d work with "wifi reload" and check if ieee80211_start_ap() -> mt7915_bss_info_changed() are called?

@ryderlee1110 If this is general question for mt7615d, then yes. It works. mt7615_bss_info_changed is called after "wifi reload" if parameter(s) in /etc/config/wireless is changed.

If question targets this issue, then no, result is the same as with commands: "wifi down; wifi".

Tested based on commits: mt76: 8696919d9aae9b673f916bca41c5e1671eec5b0e (2021-01-27) openwrt: 740af59b9c7ee879b6936dd03bf37d37a54dda47 (2021-02-02)

Step I have used to test "wifi reload" reproducing this issue:

config wifi-iface 'ap_radio1' ... option disabled '1'

porentak commented 3 years ago

I found one way to overcome this limitation. In wireless configuration, under both wifi-device(s) add: option serialize '1'

This will instruct netifd to configure wireless device interfaces one-by-one. By doing that, interfaces are set-up in same order.

With this trick, wireless configuration (e.g.: SSID, keys, ...) can be changed reliably in runtime.

ryderlee1110 commented 3 years ago

great. do you think we can close this ticket.

MeIsReallyBa commented 3 years ago

wireless scan is broken.The router could only found 5g signal in 2.4g scan interface and there is no result in 5g scan interface.Did u meet this problem?

Azq2 commented 3 years ago

wireless scan is broken.The router could only found 5g signal in 2.4g scan interface and there is no result in 5g scan interface.Did u meet this problem?

I faced the same problem.

After some debugging, I found that the reason is that iwinfo returns phy1 for radio0, but does not find radio1 at all.

Having studied the iwconfig code a little, I realized that if you specify phy instead of path, it should work.

For example:

config wifi-device 'radio0'
    option type 'mac80211'
    option phy 'phy0'
    option htmode 'HT40'
    option serialize '1'
    option country 'US'
    option cell_density '0'
    option hwmode '11g'
    option channel '1'

config wifi-device 'radio1'
    option type 'mac80211'
    option phy 'phy1'
    option serialize '1'
    option country 'US'
    option cell_density '0'
    option hwmode '11a'
    option htmode 'VHT80'
    option channel '36'
    option txpower '20'

Scanning and editing all settings in LuCi works fine.

// I using Openwrt snapshot from master

kar200 commented 3 years ago

Here to confirm that @Azq2 solution works for me (DIR-853-A3/MT7615DN with DBDC). Having the 3 options (phy, serialize and the country code) fixes the issue of freezing after reboot with both cards enabled https://github.com/openwrt/mt76/issues/448

MeIsReallyBa commented 3 years ago

wireless scan is broken.The router could only found 5g signal in 2.4g scan interface and there is no result in 5g scan interface.Did u meet this problem?

I faced the same problem.

After some debugging, I found that the reason is that iwinfo returns phy1 for radio0, but does not find radio1 at all.

Having studied the iwconfig code a little, I realized that if you specify phy instead of path, it should work.

For example:

config wifi-device 'radio0'
  option type 'mac80211'
  option phy 'phy0'
  option htmode 'HT40'
  option serialize '1'
  option country 'US'
  option cell_density '0'
  option hwmode '11g'
  option channel '1'

config wifi-device 'radio1'
  option type 'mac80211'
  option phy 'phy1'
  option serialize '1'
  option country 'US'
  option cell_density '0'
  option hwmode '11a'
  option htmode 'VHT80'
  option channel '36'
  option txpower '20'

Scanning and editing all settings in LuCi works fine.

// I using Openwrt snapshot from master

https://github.com/MeIsReallyBa/openwrt/commit/a0ffe441f69b056ecf304d7b3b9c0b5311c03ba1

I edited mac80211.sh and it seems also works correctly.

kar200 commented 3 years ago

EDIT: OK never mind it's working for me now