gnuton / asuswrt-merlin.ng

Extends the support of Merlin firmware to more ASUS routers
Other
1.38k stars 80 forks source link

XT8 nodes drop (blue lights flash) #347

Closed fawazsiddiqi closed 4 months ago

fawazsiddiqi commented 1 year ago

Router Model Affected XT8

Firmware Version Affected 388.1

Is this bug present in upstream Merlin releases too? Not sure, first time flashed with Gnuton

Describe the bug

All AiMesh restart after the cfg_server 3972:notify_rc restart_acsd command that shows up in the logs.

From what I understand that it is changing channels automatically on the Backhaul band (5GHZ-2) and thus all nodes get disconnected to adjust to new channel.

Feb 18 11:37:08 cfg_server: current chansp(unit0) is 1003
Feb 18 11:37:08 cfg_server: current chansp(unit1) is e02a
Feb 18 11:37:08 cfg_server: current chansp(unit2) is d86e
Feb 18 11:37:08 cfg_server: dump exclchans:
Feb 18 11:37:08 cfg_server: old wl0_acs_excl_chans:0x100e,0x190c
Feb 18 11:37:08 cfg_server: new wl0_acs_excl_chans:0x100e,0x190c
Feb 18 11:37:08 cfg_server: old wl1_acs_excl_chans:0xd03c,0xd83e,0xe23a
Feb 18 11:37:08 cfg_server: new wl1_acs_excl_chans:
Feb 18 11:37:08 cfg_server: old wl2_acs_excl_chans:0xd068,0xd078,0xd088,0xd966,0xd976,0xd986,0xe16a,0xe17a,0xe972,0xed72
Feb 18 11:37:08 cfg_server: new wl2_acs_excl_chans:0xd088,0xd986
Feb 18 11:37:08 cfg_server:  wl_chanspec_changed_action: Need to restart acsd for AVBL update
Feb 18 11:37:08 rc_service: cfg_server 3972:notify_rc restart_acsd
Feb 18 15:37:08 acsd: eth4: Selecting 2g band ACS policy
Feb 18 15:37:12 acsd: eth4: selected channel spec: 0x100d (13)
Feb 18 15:37:12 acsd: eth4: Adjusted channel spec: 0x100d (13)
Feb 18 15:37:12 acsd: eth4: selected channel spec: 0x100d (13)
Feb 18 15:37:12 acsd: acs_set_chspec: 0x100d (13) for reason APCS_INIT
Feb 18 15:37:12 acsd: eth5: Selecting 5g band ACS policy
Feb 18 15:37:15 acsd: eth5: selected channel spec: 0xe23a (60/80)
Feb 18 15:37:15 acsd: eth5: Adjusted channel spec: 0xe23a (60/80)
Feb 18 15:37:15 acsd: eth5: selected channel spec: 0xe23a (60/80)
Feb 18 15:37:15 acsd: acs_set_chspec: 0xe23a (60/80) for reason APCS_INIT
Feb 18 15:37:15 acsd: eth6: Selecting 5g band ACS policy
Feb 18 15:37:18 acsd: eth6: selected channel spec: 0xe07a (116/80)
Feb 18 15:37:18 acsd: eth6: Adjusted channel spec: 0xe07a (116/80)
Feb 18 15:37:18 acsd: eth6: selected channel spec: 0xe07a (116/80)
Feb 18 15:37:18 acsd: acs_set_chspec: 0xe07a (116/80) for reason APCS_INIT

To Reproduce Happens intermittently at frequent intervals

Expected behavior It shouldn't happen, unsure where it is originating from.

fawazsiddiqi commented 1 year ago

So I just downgraded to 386.07_2-gnuton1 and things seem to be a bit more stable

Feb 20 09:10:32 cfg_server:  event: wl_chanspec_changed_action
Feb 20 09:10:33 cfg_server: current chansp(unit0) is 1007
Feb 20 09:10:33 cfg_server: current chansp(unit1) is e02a
Feb 20 09:10:33 cfg_server: current chansp(unit2) is d86e
Feb 20 09:11:03 cfg_server:  event: wl_chanspec_changed_action
Feb 20 09:11:03 cfg_server: skip event due no chanlist update
lhayati commented 1 year ago

My main router is on v388 and the nodes are on official ASUS 3.0.0.4.388.22525 and they've been rock stable. Any firmware before this recently released ASUS one was dropping every 10 mins (blue light flashing). Try this combination and see if you have any luck.

fawazsiddiqi commented 1 year ago

@lhayati I went down to 386.07_2-gnuton1 because setting it all up (I have 5 nodes) took like 3 hours of my time...

lhayati commented 1 year ago

@lhayati I went down to 386.07_2-gnuton1 because setting it all up (I have 5 nodes) took like 3 hours of my time...

Haha fair enough! are your nodes on stock firmware or did you upgrade them to gnuton too? If you ever feel like tinkering again in the future, my experience is that there's no benefit moving the nodes away from stock (now that the latest official firmware is stable).

fawazsiddiqi commented 1 year ago

@lhayati I moved everything to Gnuton, but I have not tried this mixture 😅

So your main router is on 388 and the latest stock Asus? Did you notice the errors which I have attached in the issue above?

lhayati commented 1 year ago

@lhayati I moved everything to Gnuton, but I have not tried this mixture 😅

So your main router is on 388 and the latest stock Asus? Did you notice the errors which I have attached in the issue above?

Yes exactly that. The issue you are facing is why I tried gnuton in the first place. Asus stock firmware has been so unstable (until last weeks new firmware) my nodes would drop every 5-10 mins, it was unusable. I've tried every possible combination and gnuton 388, with nodes on the latest stock is the only experience that just works.

fawazsiddiqi commented 1 year ago

Well maybe I can try this, will report back if I do @lhayati

ez12a commented 1 year ago

Just wanted to chime in that I was also having issues with my wifi connected XT8 node dropping and randomly restarting. Seemed to happen at least once a day. SSH in and uptime would not be longer than a day. The flashing blue AiMesh syncing would also take a pretty long time to complete.

Just flashed the latest stock firmware onto it 3.0.0.4.388.22525 and its been stable so far. The XT8 node seems to acquire/configure AiMesh much faster as well. Running 388.1_0-gnuton1 on the router node.

fawazsiddiqi commented 1 year ago

@lhayati @ez12a I will follow your verdict now and update my nodes 😅

Putting the router (which is XT8) on 388 and then the stock ASUS firmware on the nodes.

Let's see how this goes.

Will report back later today

fawazsiddiqi commented 1 year ago

Just got some time to do this configuration, Gnuton 388 on router and ASUS latest on Nodes

All models XT8

fawazsiddiqi commented 1 year ago

Nope, nodes keep dropping....

Mar 22 14:44:53 cfg_server:  event: wl_chanspec_changed_action of eid(40) of cfgs(3502)
Mar 22 14:44:53 cfg_server: current chansp(unit0) is 1008
Mar 22 14:44:53 cfg_server: current chansp(unit1) is e02a
Mar 22 14:44:53 cfg_server: current chansp(unit2) is d86e
Mar 22 14:44:53 cfg_server: dump exclchans:
Mar 22 14:44:53 cfg_server: old wl0_acs_excl_chans:0x100e,0x190c
Mar 22 14:44:53 cfg_server: new wl0_acs_excl_chans:0x100e,0x190c
Mar 22 14:44:53 cfg_server: old wl1_acs_excl_chans:0xd034,0xd836,0xe03a
Mar 22 14:44:53 cfg_server: new wl1_acs_excl_chans:
Mar 22 14:44:53 cfg_server: old wl2_acs_excl_chans:0xd068,0xd07c,0xd080,0xd088,0xd966,0xd87e,0xd97e,0xd986,0xe16a,0xe27a,0xe37a,0xe972,0xee72,0xef72
Mar 22 14:44:53 cfg_server: new wl2_acs_excl_chans:0xd07c,0xd080,0xd87e,0xd97e,0xe27a,0xe37a,0xee72,0xef72
Mar 22 14:44:53 cfg_server:  wl_chanspec_changed_action: Need to restart acsd for AVBL update
Mar 22 14:44:53 rc_service: cfg_server 3502:notify_rc restart_acsd
Mar 22 14:44:54 acsd: eth4: Selecting 2g band ACS policy
Mar 22 14:44:58 acsd: eth4: selected channel spec: 0x100d (13)
Mar 22 14:44:58 acsd: eth4: Adjusted channel spec: 0x100d (13)
Mar 22 14:44:58 acsd: eth4: selected channel spec: 0x100d (13)
Mar 22 14:44:58 acsd: acs_set_chspec: 0x100d (13) for reason APCS_INIT
Mar 22 14:44:58 acsd: eth5: Selecting 5g band ACS policy
Mar 22 14:45:00 acsd: eth5: selected channel spec: 0xe23a (60/80)
Mar 22 14:45:00 acsd: eth5: Adjusted channel spec: 0xe23a (60/80)
Mar 22 14:45:00 acsd: eth5: selected channel spec: 0xe23a (60/80)
Mar 22 14:45:00 acsd: acs_set_chspec: 0xe23a (60/80) for reason APCS_INIT
Mar 22 14:45:00 acsd: eth6: Selecting 5g band ACS policy
Mar 22 14:45:03 acsd: eth6: selected channel spec: 0xe06a (100/80)
Mar 22 14:45:03 acsd: eth6: Adjusted channel spec: 0xe06a (100/80)
Mar 22 14:45:03 acsd: eth6: selected channel spec: 0xe06a (100/80)
Mar 22 14:45:03 acsd: acs_set_chspec: 0xe06a (100/80) for reason APCS_INIT
Mar 22 14:45:08 wlceventd: wlceventd_proc_event(632): eth6: Radar detected
Mar 22 14:45:08 cfg_server:  event: wl_chanspec_changed_action of eid(41) of cfgs(3502)
Mar 22 14:45:08 cfg_server: current chansp(unit0) is 100d
Mar 22 14:45:08 cfg_server: current chansp(unit1) is e23a
Mar 22 14:45:08 cfg_server: current chansp(unit2) is d96e
Mar 22 14:45:08 cfg_server: dump exclchans:
Mar 22 14:45:08 cfg_server: old wl0_acs_excl_chans:0x100e,0x190c
Mar 22 14:45:08 cfg_server: new wl0_acs_excl_chans:0x100e,0x190c
Mar 22 14:45:08 cfg_server: old wl1_acs_excl_chans:
Mar 22 14:45:08 cfg_server: new wl1_acs_excl_chans:
Mar 22 14:45:08 cfg_server: old wl2_acs_excl_chans:0xd07c,0xd080,0xd87e,0xd97e,0xe27a,0xe37a,0xee72,0xef72
Mar 22 14:45:08 cfg_server: new wl2_acs_excl_chans:0xd068,0xd07c,0xd080,0xd966,0xd87e,0xd97e,0xe16a,0xe27a,0xe37a,0xe972,0xee72,0xef72
Mar 22 14:45:08 cfg_server:  wl_chanspec_changed_action: Need to restart acsd for AVBL update
Mar 22 14:45:08 rc_service: cfg_server 3502:notify_rc restart_acsd
Mar 22 14:45:08 acsd: eth4: Selecting 2g band ACS policy
Mar 22 14:45:13 acsd: eth4: selected channel spec: 0x100c (12)
Mar 22 14:45:13 acsd: eth4: Adjusted channel spec: 0x100c (12)
Mar 22 14:45:13 acsd: eth4: selected channel spec: 0x100c (12)
Mar 22 14:45:13 acsd: acs_set_chspec: 0x100c (12) for reason APCS_INIT
Mar 22 14:45:13 acsd: eth5: Selecting 5g band ACS policy
Mar 22 14:45:13 acsd: eth5: selected channel spec: 0xe23a (60/80)
Mar 22 14:45:13 acsd: eth5: Adjusted channel spec: 0xe23a (60/80)
Mar 22 14:45:13 acsd: eth5: selected channel spec: 0xe23a (60/80)
Mar 22 14:45:13 acsd: acs_set_chspec: 0xe23a (60/80) for reason APCS_INIT
Mar 22 14:45:13 acsd: eth6: Selecting 5g band ACS policy
Mar 22 14:45:15 acsd: eth6: selected channel spec: 0xd886 (132l)
Mar 22 14:45:15 acsd: eth6: Adjusted channel spec: 0xd886 (132l)
Mar 22 14:45:15 acsd: eth6: selected channel spec: 0xd886 (132l)
Mar 22 14:45:15 acsd: acs_set_chspec: 0xd886 (132l) for reason APCS_INIT
Mar 22 14:45:20 wlceventd: wlceventd_proc_event(632): eth6: Radar detected
Mar 22 14:45:20 cfg_server:  event: wl_chanspec_changed_action of eid(42) of cfgs(3502)
Mar 22 14:45:20 cfg_server: current chansp(unit0) is 100c
Mar 22 14:45:20 cfg_server: current chansp(unit1) is e23a
Mar 22 14:45:20 cfg_server: current chansp(unit2) is d96e
Mar 22 14:45:20 cfg_server: dump exclchans:
Mar 22 14:45:20 cfg_server: old wl0_acs_excl_chans:0x100e,0x190c
Mar 22 14:45:20 cfg_server: new wl0_acs_excl_chans:0x100e,0x190c
Mar 22 14:45:20 cfg_server: old wl1_acs_excl_chans:
Mar 22 14:45:20 cfg_server: new wl1_acs_excl_chans:
Mar 22 14:45:20 cfg_server: old wl2_acs_excl_chans:0xd068,0xd07c,0xd080,0xd966,0xd87e,0xd97e,0xe16a,0xe27a,0xe37a,0xe972,0xee72,0xef72
Mar 22 14:45:20 cfg_server: new wl2_acs_excl_chans:0xd068,0xd07c,0xd080,0xd088,0xd966,0xd87e,0xd97e,0xd986,0xe16a,0xe27a,0xe37a,0xe972,0xee72,0xef72
Mar 22 14:45:20 cfg_server:  wl_chanspec_changed_action: Need to restart acsd for AVBL update
Mar 22 14:45:20 rc_service: cfg_server 3502:notify_rc restart_acsd
Mar 22 14:45:20 acsd: eth4: Selecting 2g band ACS policy
fawazsiddiqi commented 1 year ago

Rolled back to 386.07_2-gnuton1

Smokey613 commented 1 year ago

Are you running 386.07_2-gnuton1 on the primary only or on both primary and the node?

fawazsiddiqi commented 1 year ago

@Smokey613 all nodes

carloss66 commented 1 year ago

I am having the same problem with 388.1 or official ASUS 3.0.0.4.388.22525. This is not a new firmware issue, I have been struggling with this problem for almost 2 years. Factory resetting the nodes is not a good workaround, especially if you have a lot of IoT devices with dedicated IP addresses, parental controls enabled, QoS, etc. If you factory reset only the slave node, it gets back from the router whatever configuration is causing the slave to randomly reboot. Factory reseting both the router and the node fixes the problem temporarily, but it comes back a few months later. After months with no issues, the problem started again this week. As usual, factory resetting the slave does not fix the problem, and I was avoiding at all cost another complete factory reset and configuring the router and slave again. I disabled services one at a time, and then monitored the slave behavior. Disabling Traffic Analyzer, Apps Analysis and Web History (both under QoS) did nothing to fix the problem. Disabling bandwidth limiting QoS helped reduce the reboots, but the node is still rebooting.

d3v3l15h commented 1 year ago

I never had a stable setup using XT8s pair with 388 branch firmwares (either Merlin/Gnuton ones or Asus stock). I recently tried 388.2_2_0-gnuton1 and quickly went back as I had constant 5Ghz backhaul dropouts (after few minutes, the node only connects in 2.4Ghz with very weak signal) and Wifi instability. 386.07_2-gnuton1 is still the best XT8 candidate for me. Hopefully at some point a release will magically fix the problem as it seems nobody really knows what goes wrong.

fawazsiddiqi commented 1 year ago

@d3v3l15h I've given up...

d3v3l15h commented 1 year ago

@fawazsiddiqi Yeah I understand... As for me I just reinstalled latest official stable release 388.2_2_0-gnuton1 on both router (XT8) and node (XT8) and, after reset to factory defaults, problem is even worse as I have frequent disconnections of a lot of clients on all WiFi bands... The familly is mad at me ! :-( I tried lot of different settings (now unticking 160 Mhz and DFS channels for backhaul) and found no solution so far. Roaming some devices does not change anything for them (still suffering frequent disconnections). Another weird thing I see is that, although I checked the box to hide the wireless backhaul SSID, it's not hidden at all... I'll continue to test but I'll soon have to revert if I don't want to be killed by my daughters.

d3v3l15h commented 1 year ago

So... After a complete afternoon "playing" with the XT8s, I still cannot have stable WiFi connections. I reset (factory default) maybe three times, both router and node, tried Router and AP mode. I changed all Wi-Fi SSID and keys (on a general basis, I use very long key with lots of special characters), used with and without "Smart Connect", etc... and Wifi instability remains. Node is lost every 10 minutes or so on 5Ghz-2 and various devices lose connections for few seconds on 5Ghz-1 at the "same" frequency (not sure for 2.4 Ghz). Side notes :

I'll continue my tests, maybe...

d3v3l15h commented 1 year ago

Ok, I'm fed up with this. Cannot even properly test as I even found out that, at some point, my GUI settings changes were not taken into account... I had to restart httpd service.

Without any doubt, connection drops arise at that point in the log:

Aug 21 13:05:09 cfg_server: cm_updateChanspec call wl_chanspec_changed_action
Aug 21 13:05:09 cfg_server:  event: wl_chanspec_changed_action_a101 of eid(10) of cfgs(2433)
Aug 21 13:05:09 cfg_server: current chansp(unit0) is 1008
Aug 21 13:05:09 cfg_server: current chansp(unit1) is e02a
Aug 21 13:05:09 cfg_server: current chansp(unit2) is e06a
Aug 21 13:05:09 cfg_server: dump exclchans:
Aug 21 13:05:09 cfg_server: old wl0_acs_excl_chans:0x100c,0x190a,0x100d,0x190b,0x100e,0x190c
Aug 21 13:05:09 cfg_server: new wl0_acs_excl_chans:0x100c,0x190a,0x100d,0x190b,0x100e,0x190c
Aug 21 13:05:09 cfg_server: old wl1_acs_excl_chans:
Aug 21 13:05:09 cfg_server: new wl1_acs_excl_chans:
Aug 21 13:05:09 cfg_server: old wl2_acs_excl_chans:0xd07c,0xd87e,0xe27a,0xee72
Aug 21 13:05:09 cfg_server: new wl2_acs_excl_chans:
Aug 21 13:05:09 cfg_server:  wl_chanspec_changed_action: Need to restart acsd for AVBL update
Aug 21 13:05:09 rc_service: cfg_server 2433:notify_rc restart_acsd

Side note: Unticking "Auto select channel including DFS channels" is not saved when I hit "Apply" so it's quite tricky to exclude DFS from the possible causes of the problems.

fawazsiddiqi commented 1 year ago

@d3v3l15h these are the exact same things I've been noticing. Usually it breaks connection when it goes for a new channel search, setting the channel doesn't help either.

This is for both, Gnuton and also official ASUS firmware. To be honest, I think this issue is underlying within Broadcom.

d3v3l15h commented 1 year ago

@fawazsiddiqi Do we know if Asus/Broadcom is investigating this ? Internet is full of pages regarding these XT8 Wi-Fi "recent" problems. It's a shame as, otherwise, this mesh system is quite a network war machine with Gnuton's firmwares...

fawazsiddiqi commented 1 year ago

@d3v3l15h my bets will be that they are not, which is the sh*t part. I'm stuck with these as well at this point.

It's been a year since I've had these routers and still no fix.

d3v3l15h commented 1 year ago

For info, I switched the node (only) to latest ASUS firmware 3.0.0.4.388.23285 and it seems to be a little bit more stable but not more usable (drops "every" 30 minutes instead of 10 minutes...). Messages in the logs are the same.

d3v3l15h commented 1 year ago

OK, final conclusion here: There is definitely a problem in the way Asus/Broadcom manages the DFS (a.k.a. Dynamic Frequency Selection) on the 5GHz-1 and 5GHz-2 bands since few firmware releases and I think why we're not all affected by the bug is because of different local regulations. Let me explain:

(refer to https://en.wikipedia.org/wiki/List_of_WLAN_channels or graphics found at https://wlanprofessionals.com/updated-unlicensed-spectrum-charts/ for better understanding)

If you live in the US for instance, as I understand it, you do have now a 160 MHz bandwith available without any DFS on 5GHz-2 (channels 149 to 177) so you can (and you MUST if you want no WiFi drops) untick the "Auto select channel including DFS channels" checkbox and leave "Auto" control channel. This way you'll have absolutely stable 160 MHz bandwith wireless backhaul on 5 GHz-2. As for 5Ghz-1, for the very same reasons, you'll untick the "Auto select channel including DFS channels" checkbox and leave "Auto" control channel. This way, the AP will stay on "non-DFS" channels 36 to 48 (and not "DFS" channels 52 to 64 also available on 5GHZ-1) and you'll have a stable 80 MHz bandwith wireless on 5 GHz-1. The perfect "US setup".

Now, if you live in Europe for instance (or countries with similar 5 GHz band regulations), I'll began with the 5 GHz-1 band: Situation is the same as in the US so the same "36-48 channels setup" SHOULD BE rock stable. As regards 5 GHz-2, contrary to the US, you DO NOT HAVE a 160 MHz bandwith available without DFS on 5 GHz-2. This explains why you CANNOT untick "Auto select channel including DFS channels" on 5GHz-2 here. So, your best solution for the most stable wireless backhaul is to play with channel bandwith: 160 MHz is almost a no go because of weather radars (channel switching may occur every ten minutes...). 80 MHz seems OK. In this case "Auto" control channel appears to be OK also, or channel "100" if you want a fixed one (just check infos in the very interesting "System log\Wireless log\Display low level details" screen to see how radar detection and channel selection are linked).

So, then, what's wrong with recent Asus firmwares? Their problem is that each DFS event on 5 GHz-2 band (which you cannot avoid as you cannot select any "non-DFS" channels outside of the US) causes a real storm on all 5 GHz WiFi connections including 5 GHz-1 connections and this, even if you choose not to use DFS channels on 5 GHz-1 by unticking the box! A DFS event related to the wireless backhaul which, with old firmwares, was completely transparent for the connected devices (so you didn't notice anything regarding the node switching channel following a radar detection) has become a real nightmare now.

@fawazsiddiqi I don't have time to go back to 386.07_2-gnuton1 for now but can you please tell me your exact setup and behavior with this firmware regarding "Channel bandwith", "Control channel" and "Auto select DFS channels" settings for both 5 GHz-1 and 5 GHz-2 ? Did you tick "160 MHz" and, if so, do you see any specific "Temporarily Out of Service" messages for some 5 GHz-2 channels in "System log\Wireless log\Display low level details" screen? There is a countdown attached to these messages. Are the devices disconnected from 5 GHz-1 when this countdown ends ?

Last thing I'm thinking about: We have 149-165 "non-DFS" channels in Europe (but with limited 25 mW power) which, maybe, where available and used on 5 GHz-2 with old firmwares and are not available anymore in new firmwares?...

fawazsiddiqi commented 6 months ago

@d3v3l15h Hi there! So basically, I think I got it fired out somehow, I moved all my XT8's to Ethernet and there is only 1 node that is on wireless at this point and the system seems to be stable, I also moved to the stock firmware (the latest one that was released in 2024 3.0.0.4.388_24609)

And I haven't noticed any drops

d3v3l15h commented 6 months ago

@fawazsiddiqi OK, I'm on latest gnuton's release now (3004.388.5_0-gnuton1) and it seems a little bit better on 5 GHz-1 but still, DFS is crap.

fawazsiddiqi commented 6 months ago

@d3v3l15h what happens if you switch off DFS? I heard it's based on regional support

d3v3l15h commented 6 months ago

@fawazsiddiqi as I wrote it in my long post "As regards 5 GHz-2, contrary to the US, you DO NOT HAVE a 160 MHz bandwith available without DFS on 5 GHz-2. This explains why you CANNOT untick "Auto select channel including DFS channels" on 5GHz-2 here." I can stop using 160 MHz bandwith but then, what's the point... It just shows that WiFi 6, 6E (and 7) is "useless" for now in Europe.

fawazsiddiqi commented 6 months ago

@d3v3l15h I just hope they fix this 🤣