libremesh / lime-packages

LibreMesh packages configuring OpenWrt for wireless mesh networking
https://libremesh.org/
GNU Affero General Public License v3.0
277 stars 96 forks source link

MESH-SAE-AUTH-FAILURE #837

Open rallep71 opened 3 years ago

rallep71 commented 3 years ago

I have three routers in my Lime mesh network, TP WDR 4300, TP Archer C50 V3 and V4.

All built the firmware according to the instructions. What is the error?

Thu Dec 24 11:54:04 2020 daemon.notice wpa_supplicant[2606]: wlan0-mesh: MESH-SAE-AUTH-FAILURE addr=b0:4e:26:45:63:ac Thu Dec 24 11:54:23 2020 daemon.notice wpa_supplicant[2606]: wlan0-mesh: MESH-SAE-AUTH-FAILURE addr=b0:4e:26:45:63:ac Thu Dec 24 11:54:38 2020 daemon.notice wpa_supplicant[2606]: wlan0-mesh: MESH-SAE-AUTH-FAILURE addr=b0:4e:26:45:63:ac Thu Dec 24 11:54:38 2020 daemon.notice wpa_supplicant[2606]: wlan0-mesh: MESH-SAE-AUTH-BLOCKED addr=b0:4e:26:45:63:ac duration=300

ilario commented 3 years ago

Can you post more details? Like the /etc/config/lime-autogen content of all the three routers or the output of lime-report command from all the routers?

dangowrt commented 3 years ago

It'd also be very interesting which version of OpenWrt and which build variant of wpad-mesh you are using. wpad-mesh-wolfssl only works well on OpenWrt 19.07.5 and later (due to a performance bug in hostapd's usage of WolfSSL API which leads to SEA failing due to timeout). Ie. what you are experiencing is symptomatically what we saw before https://github.com/openwrt/openwrt/commit/d8d1956a8087da2fd4465c4381d9e28b91cdc1e8

rallep71 commented 3 years ago

Hello everyone, I am posting the three reports https://drive.google.com/file/d/1fpvugPnB9jzcZVs5RSDhHqL-pOd0Eczq/view?usp=sharing https://drive.google.com/file/d/1KF2GJId3gZeFXmjwmJHUO25SO9AJTM__/view?usp=sharing https://drive.google.com/file/d/1lhEaMB6CUhxbbpzXMS17HvTtcWBTps2G/view?usp=sharing

there is also no wpad-mesh-wolfssl installed, there is wpad-mesh-openssl installed. Because when I select the profiles lime default and lime encrypt, wpad-mesh-openssl is automatically selected and I cannot change it to wpad-mesh-wolfssl.

rallep71 commented 3 years ago

@dangowrt hostapd is not installd, it is installd hostapd-common 2019-08-08-ca8c2bd2-4 https://libremesh.org/development.html firmware build from git clone -b v19.07.5 --single-branch https://git.openwrt.org/openwrt/openwrt.git @ilario i have network porfile and and make modified -wpad-mesh-openssl to wpad-mesh-wolfssl, now im compiling new images for the three routers and will see what happend ;)

So, new images are ready and installed, here are the reports of the three routers, MESH-SAE-AUTH-FAILURE is still there. https://drive.google.com/file/d/17AANtEg7LnNxE_VsBgeBitwiqJ6XLTj0/view?usp=sharing https://drive.google.com/file/d/1YJj-oPIJQqVUJ4mWPQhJXVW3yCfFFJz-/view?usp=sharing https://drive.google.com/file/d/1O8zk_V7bgqijrgbuff-1ODYmknU8oc4r/view?usp=sharing

rallep71 commented 3 years ago

I have only had two routers in operation for about 20 hours now, TP Archer C50 v3 and v4, the WDR4300 router is switched off. I no longer have any messages (MESH-SAE-AUTH-FAILURE) in the C50 v3 and v4. very confusing.....

And now Start wdr4300 and log in rootnode Tue Dec 29 16:15:35 2020 daemon.notice wpa_supplicant[2492]: wlan1-mesh: new peer notification for 64:70:02:a2:fd:24 Tue Dec 29 16:15:53 2020 daemon.notice wpa_supplicant[2492]: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5 Tue Dec 29 16:15:55 2020 daemon.notice wpa_supplicant[2492]: wlan1-mesh: MESH-SAE-AUTH-FAILURE addr=64:70:02:a2:fd:24

ilario commented 3 years ago

there is also no wpad-mesh-wolfssl installed, there is wpad-mesh-openssl installed. Because when I select the profiles lime default and lime encrypt, wpad-mesh-openssl is automatically selected and I cannot change it to wpad-mesh-wolfssl.

Can you confirm that now you have wpad-mesh-wolfssl on the three routers? Try not selecting any network-profiles, as in your case they should not be needed (usually are for communities willing to simplify the configuration process and selection of packages).

Are you sure that using SAE is a good idea? I have not a clue on this, but if it creates problems we cannot solve (I surely cannot, maybe @dangowrt or @aparcar?) you could stick to psk2/aes, as suggested in the lime-example.

rallep71 commented 3 years ago

Hello Ilario yes, I have wpad-mesh-wolfssl on all three routers.I have adjusted the porfile.index and profile.mk locally so that wpad-mesh-wolfssl is automatically selected. I thought not too long ago we wrote about wolfssl being smaller in resource consumption than openssl. I don't know if SAE is a good idea, but I've seen it in the profiles of Freifunk, who also use it. I will now test everything again with the Lime sample, i.e. openssl, psk2/aes. Let's see if I still get the error.

ilario commented 3 years ago

Hello Ilario I will now test everything again with the Lime sample, i.e. openssl, psk2/aes. Let's see if I still get the error.

Please try simply keeping the image you have with wolfssl and just editing the /etc/config/lime-node to indicate psk2/aes

rallep71 commented 3 years ago

Hello Ilario, I have now tested this again with psk2 aes and wolfssl, with wdr4300 the error comes back....Good, the three nodes communicate with each other, I can move with the smartphone in the nodes without crashes, switching between the nodes is fast.

I will change the wdr4300 router again and replace it with an archer c50v3, let's see what happens then

rallep71 commented 3 years ago

Hello, problem solved. mesh compiled with psk2 + aes +openssl openwrt 19.07.6

Catfriend1 commented 3 years ago

I'm currently experiencing exactly the same MESH_AUTH_FAILURE and then MESH_AUTH_BLOCKED using wpad-mesh-wolfssl on OpenWrt 21.02.0-rc.1. But it only seems to be an issue if one mesh partner reboots (we have a maintenance reboot in the night once a week) and then the MESH_AUTH_FAILURE come up while it worked perfectly before that reboot event.

ilario commented 3 years ago

@rallep71 did you finally use WolfSSL or OpenSSL? @Catfriend1 are you using also LibreMesh or directly OpenWrt? Can you try with the openssl version of wpad? Thanks!

Catfriend1 commented 3 years ago

@ilario Using Openwrt directly here. I can test it, will take time until I get to observe a week.

ilario commented 3 years ago

Anyway I'm not sure that in the LibreMesh community there are many people actually encrypting 802.11s. Did you ask also in OpenWrt forums?

ilario commented 3 years ago

In the Element chatroom (see here for the direction) @egon0 mentioned that the solution was to switch to OpenSSL.

dangowrt commented 3 years ago

Generally this seems to be performance/timing related. The bug in hostapd/wpa_supplicant/wpad which caused those symptoms when using WolfSSL previously was to use a too costly function to generate random numbers (generating random prime numbers instead of just arbitrary random numbers). Once this had been fixed, things cleared up here and from what I can tell, the bug is gone now, running OpenWrt 19.07.6 seems fairly stable with wpad-mesh-wolfssl.

root@stannebeinplatz-m5:~# opkg list-installed | grep wolf
libwolfssl24 - 4.6.0-stable-1
wpad-mesh-wolfssl - 2019-08-08-ca8c2bd2-4

root@rdntz-stannebeinplatz:~# uptime
 06:21:41 up 95 days, 15:02,  load average: 0.58, 0.49, 0.46

root@rdntz-stannebeinplatz:~# ifconfig
[...]
wlan0-mesh_13 Link encap:Ethernet  HWaddr F0:9F:C2:8C:81:7A  
          inet addr:169.254.129.122  Bcast:255.255.255.255  Mask:255.255.255.255
          inet6 addr: fd70:6bf5:5eab:b59d:40ed:cc17:5c70:1dd3/16 Scope:Global
          inet6 addr: fe80::f29f:c2ff:fe8c:817a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1206901449 errors:0 dropped:0 overruns:0 frame:0
          TX packets:681661127 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1552561276067 (1.4 TiB)  TX bytes:123728430764 (115.2 GiB)

root@rdntz-stannebeinplatz:~# iw dev wlan0-mesh station dump
Station 80:2a:a8:bc:16:bb (on wlan0-mesh)
    inactive time:  0 ms
    rx bytes:   184859508823
    rx packets: 161009141
    tx bytes:   25040698030
    tx packets: 79598525
    tx retries: 6421311
    tx failed:  133
    rx drop misc:   138975
    signal:     -65 [-73, -68] dBm
    signal avg: -64 [-72, -66] dBm
    Toffset:    18446739638920988636 us
    tx bitrate: 162.0 MBit/s MCS 12 40MHz
    rx bitrate: 120.0 MBit/s MCS 11 40MHz short GI
    rx duration:    0 us
    last ack signal:35 dBm
    expected throughput:    49.163Mbps
    mesh llid:  0
    mesh plid:  0
    mesh plink: ESTAB
    mesh local PS mode: ACTIVE
    mesh peer PS mode:  ACTIVE
    mesh non-peer PS mode:  ACTIVE
    authorized: yes
    authenticated:  yes
    associated: yes
    preamble:   long
    WMM/WME:    yes
    MFP:        yes
    TDLS peer:  no
    DTIM period:    2
    beacon interval:100
    connected time: 928198 seconds

In order to address what hence seems to be a regression, it'd be great to know whether

Catfriend1 commented 3 years ago

@dangowrt Using TPlink archer c7v2 and v5 with ath10k non ct drivers here. it only occurs in 21.02rc1 after running days? a week? (immediate reboots , wifi restart cannot reproduce the problem ; weekly reboot does 100% reproduce)

iw dev wlan0 (my mesh interface) shows MTU 1532 (due to batman-adv usage and configuration).

iw dev wlan0 station dump results in

failed to parse nested attributes

egon0 commented 3 years ago

thx @dangowrt for these infos and clarification.

i will have a look into this, using vanilla openwrt on wdr3600 and 2x archer c7 v5. i will try to upgrade to 19.07.6 and give it a try.

goligo commented 3 years ago

I have experienced the same issue with 21.02-rc1, but using wpad-mesh-openssl, not wolfssl. In the log first I get a new peer notification, then five times MESH-SAE-AUTH-FAILURE, followed by MESH-SAE-AUTH-BLOCKED for 300 seconds.

Hardware is TP-Link WDR4300, running 802.11s encrypted mesh with BATMAN.

Catfriend1 commented 3 years ago

I'm currently trying Openwrt 21.02.0-rc2 which has another getrandom package version shipped if that matters.

mickeyreg commented 3 years ago

Archer C2 v1 OpenWrt 19.07-SNAPSHOT r11328-81266d9001 The same issue. MESH-SAE-AUTH-FAILURE followed by MESH-SAE-AUTH-BLOCKED for 300 seconds.

Without encryption mesh works properly.

Archer C7 v5, OpenWrt 19.07.7, r11306-c4a6851c72, encrypted mesh works properly on 5GHz and on 2.4GHz.

I don't know. Reading above I think that C2 and WDR4300 are not fast enough to generate messages for handshake?

Archer C50 + Archer C2 also MESH-SAE-AUTH-FAILURE followed by MESH-SAE-AUTH-BLOCKED.

Catfriend1 commented 3 years ago

@mickeyreg For me, everything was ok on Archer C7v2|5 until I upgraded beyond 19.07.7 (snapshot, 21.02 rcX)

mickeyreg commented 3 years ago

I'm new in mesh configuration. The first try was on Archer C7 v5. I could not get it working on 21.02, so I make downgrade to 19.07 and succesfuly configured everything. Wireless did not work at all on my C7 with 21.02 :( As I can read above I have a little older SNAPSHOT on C7 than on C2. I'll try to upgrade C7 next week.

Catfriend1 commented 3 years ago

@mickeyreg please see https://forum.openwrt.org/t/state-of-tp-link-archer-c7v2-v5-in-2021/95787 My mesh works on 21.02 but sometimes gets this auth failures.

djStolen commented 3 years ago

Hello, problem solved. mesh compiled with psk2 + aes +openssl openwrt 19.07.6

Hi all, I tried to setup with psk2+aes but either, I am doing something wrong or it is not allowing me to setup psk2+aes as encryption for mesh.

Could u pls share your /config/ file where u setup this?

djStolen commented 3 years ago

Hi guys,

does the Hardware have to set

sta->sae->state = SAE_ACCEPTED

because I cannot find it anywhere in the code?

If that's not the case, and I am not wrong with my search, it's to expect it fails every time on check:

if (sta->sae->state != SAE_ACCEPTED)

in void mesh_auth_timer(void *eloop_ctx, void *user_data)

Catfriend1 commented 3 years ago

@djStolen I don't know the implementation but your finding sounds reasonable.

mickeyreg commented 3 years ago

I have tested C2 with 19.07.7, also 19.07.6 and 19.07.5 does not work with SAE. But all versions have now newer versions of wolfssl library, than mentioned above.

I have tested also C7 v5 with 19.07 SNAPSHOT - works with authentication without problems.

mickeyreg commented 3 years ago

C2 is slower than C7, but the setting is: #define MESH_AUTH_TIMEOUT 10 <- it is 10 seconds, so too long for real timeout...

It tooks on C7 less than 1 second:

Mon Jun  7 12:17:27 2021 daemon.notice wpa_supplicant[1936]: wlan0: new peer notification for 50:d4:f7:15:15:29
Mon Jun  7 12:17:27 2021 daemon.notice wpa_supplicant[1936]: wlan0: mesh plink with 50:d4:f7:15:15:29 established
Mon Jun  7 12:17:27 2021 daemon.notice wpa_supplicant[1936]: wlan0: MESH-PEER-CONNECTED 50:d4:f7:15:15:29

I found the problem also in C7 logs:

Wed Jun  2 09:27:06 2021 daemon.notice wpa_supplicant[1936]: wlan0: new peer notification for 50:d4:f7:15:1f:0a
Wed Jun  2 09:27:16 2021 daemon.notice wpa_supplicant[1936]: wlan0: MESH-SAE-AUTH-FAILURE addr=50:d4:f7:15:1f:0a
Wed Jun  2 09:27:31 2021 daemon.notice wpa_supplicant[1936]: wlan0: MESH-SAE-AUTH-FAILURE addr=50:d4:f7:15:1f:0a
Wed Jun  2 09:27:49 2021 daemon.notice wpa_supplicant[1936]: wlan0: MESH-SAE-AUTH-FAILURE addr=50:d4:f7:15:1f:0a
Wed Jun  2 09:28:07 2021 daemon.notice wpa_supplicant[1936]: wlan0: MESH-SAE-AUTH-FAILURE addr=50:d4:f7:15:1f:0a
Wed Jun  2 09:28:07 2021 daemon.notice wpa_supplicant[1936]: wlan0: MESH-SAE-AUTH-BLOCKED addr=50:d4:f7:15:1f:0a duration=300

Message is (can be...) generated because of timeout, but also because of ... I don't know ... broken frames?

djStolen commented 3 years ago

I have tested C2 with 19.07.7, also 19.07.6 and 19.07.5 does not work with SAE. But all versions have now newer versions of wolfssl library, than mentioned above.

I have tested also C7 v5 with 19.07 SNAPSHOT - works with authentication without problems.

So u took the latest repository stand ? Right?

mickeyreg commented 3 years ago

Yes. I don't know where to find older version of packages. The mesh version of wpad is required, so it has to be reinstalled.

djStolen commented 3 years ago

Yes. I don't know where to find older version of packages. The mesh version of wpad is required, so it has to be reinstalled.

What do u mean older version? Why would I need older version?

Mesh version of wpad at least, full version also supports mesh functionalities, as seen here

there is an update via opkg for wolfssl to the latest 4.7.0, so it should not make any difference if I update on a running system in comparison to compiling with the latest 4.7.0.

mickeyreg commented 3 years ago

You can see above in comment of @dangowrt: https://github.com/libremesh/lime-packages/issues/837#issuecomment-842034728 that encrypted mesh works on OpenWrt 19.07.6 with libwolfssl24 - 4.6.0-stable-1 and wpad-mesh-wolfssl - 2019-08-08-ca8c2bd2-4. Now I can make test with this older version of OpenWrt, but only with newer version of ssl libraries. If the problem is performance/timing related than a newer version not necessarily negligible.

mickeyreg commented 3 years ago

I switeched radio from AC to N mode (VHT40 to HT40) and:

Tue Jun  8 09:50:13 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-GROUP-STARTED ssid="mesh_5G" id=0
Tue Jun  8 09:50:13 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for d4:6e:0e:c6:0e:e6
Tue Jun  8 09:50:14 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:14 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:14 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:24 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=d4:6e:0e:c6:0e:e6
Tue Jun  8 09:50:26 2021 daemon.notice wpa_supplicant[1640]: wlan0: mesh plink with c4:6e:1f:40:9e:bc established
Tue Jun  8 09:50:26 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-PEER-CONNECTED c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:30 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=30:b5:c2:96:1b:9b
Tue Jun  8 09:50:31 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:31 2021 daemon.notice wpa_supplicant[1640]: wlan0: mesh plink with 30:b5:c2:96:1b:9b established
Tue Jun  8 09:50:31 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-PEER-CONNECTED 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:39 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=d4:6e:0e:c6:0e:e6
Tue Jun  8 09:50:50 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=d4:6e:0e:c6:0e:e6
Tue Jun  8 09:51:01 2021 daemon.notice wpa_supplicant[1640]: wlan0: mesh plink with d4:6e:0e:c6:0e:e6 established
Tue Jun  8 09:51:01 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-PEER-CONNECTED d4:6e:0e:c6:0e:e6

Not without probems but works...

djStolen commented 3 years ago

I switeched radio from AC to N mode (VHT40 to HT40) and:

Tue Jun  8 09:50:13 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-GROUP-STARTED ssid="mesh_5G" id=0
Tue Jun  8 09:50:13 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for d4:6e:0e:c6:0e:e6
Tue Jun  8 09:50:14 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:14 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:14 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:15 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:24 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=d4:6e:0e:c6:0e:e6
Tue Jun  8 09:50:26 2021 daemon.notice wpa_supplicant[1640]: wlan0: mesh plink with c4:6e:1f:40:9e:bc established
Tue Jun  8 09:50:26 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-PEER-CONNECTED c4:6e:1f:40:9e:bc
Tue Jun  8 09:50:30 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=30:b5:c2:96:1b:9b
Tue Jun  8 09:50:31 2021 daemon.notice wpa_supplicant[1640]: wlan0: new peer notification for 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:31 2021 daemon.notice wpa_supplicant[1640]: wlan0: mesh plink with 30:b5:c2:96:1b:9b established
Tue Jun  8 09:50:31 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-PEER-CONNECTED 30:b5:c2:96:1b:9b
Tue Jun  8 09:50:39 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=d4:6e:0e:c6:0e:e6
Tue Jun  8 09:50:50 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-SAE-AUTH-FAILURE addr=d4:6e:0e:c6:0e:e6
Tue Jun  8 09:51:01 2021 daemon.notice wpa_supplicant[1640]: wlan0: mesh plink with d4:6e:0e:c6:0e:e6 established
Tue Jun  8 09:51:01 2021 daemon.notice wpa_supplicant[1640]: wlan0: MESH-PEER-CONNECTED d4:6e:0e:c6:0e:e6

Not without probems but works...

Ok, good to know since my device has no AC mode.

djStolen commented 3 years ago

You can see above in comment of @dangowrt: https://github.com/libremesh/lime-packages/issues/837#issuecomment-842034728 that encrypted mesh works on OpenWrt 19.07.6 with libwolfssl24 - 4.6.0-stable-1 and wpad-mesh-wolfssl - 2019-08-08-ca8c2bd2-4. Now I can make test with this older version of OpenWrt, but only with newer version of ssl libraries. If the problem is performance/timing related than a newer version not necessarily negligible.

Ok, got it, what u meant. I already tried, and it behaves the same with me.

djStolen commented 3 years ago

I have another question.

will

#ifdef CONFIG_DEBUG_SYSLOG

work with default "logger"?

and is there a predefined place (in make menuconfig e.g.) for defining CONFIG_DEBUG_SYSLOG?

mickeyreg commented 3 years ago

I'm not a specialist and I can describe only my observations. It looks like mesh mode needs stable radio. I configured C7 v5 without any problems (I ignore the need to uninstall CT drivers). Next I tried to set mesh on C2 devices. It was configured like C7 before ... and did not work. I started searching...

First I changed channel from 149 to 48 and it gave me stable mesh ... but without encryption. I had MESH-SAE-AUTH-FAILURE and MESH-SAE-AUTH-BLOCKED when I turned on SAE.

I've tried WolfSSL and OpenSSL. I've tried newer and older OpenWrt. Finally I've had an idea to turn radio to slower mode. Slower modes are more immune to low SNR and other signal distortions. And it magically started to work :) Now it works (tested): with wpad-mesh-wolfssl, with wpad-mesh-openssl, and with different ones on mesh nodes.

Catfriend1 commented 3 years ago

I've tried WolfSSL and OpenSSL.

Meanwhilst, I've also tried that, but it didn't help. Sometimes, about 24 hours later after the "block loop" occured, it heals itself - of if I go physically to the AP's location and POR it.

nemesifier commented 3 years ago

I have also observed this issue happening after power outages and I have not been able to explain why it happens. Last time it happened I had to reboot a couple of devices several time for this to go away.

I'm using a build from OpenWRT master of a couple months ago, but I am not using Libremesh, I don't think this is a Libremesh specific issue, it probably has to do either with the linux 802.11s implementation, the ssl library or something else. I'm posting this info here because I didn't have luck in finding other OpenWRT users in the OpenWRT forum who had the same issue. I tried increasing the log level on the radios to see if I can capture more detailed info but no luck so far. If anyone has suggestions on steps to follow to debug the issue in more detail please let us know!

PS: did anybody see these errors as well?

daemon.notice wpa_supplicant[7398]: nl80211: Failed to set interface into station mode
daemon.err wpa_supplicant[7398]: mesh1: mesh leave error=-134
goligo commented 3 years ago

A little more info regarding my setup and observations, in the hope it will help to better understand/narrow down the issue: I have three TL-WDR4300, one of them connected via cable for internet access. I am using 21.02 and have configured mesh like this:

config wifi-iface 'mesh0'
        option device 'radio0'
        option ifname 'mesh0'
        option network 'nwi_mesh0'
        option mode 'mesh'
        option mesh_fwding '0'
        option mesh_id 'mesh0'
        option key '<secret key>'
        option mesh_rssi_threshold '0'
        option encryption 'sae'

The connection from the main node to two mesh nodes are -70dBm (144MBit) and -80dBm (43MBit). This works stable for several weeks, until without any obvious cause or trigger, either one or both mesh nodes get lost and can no longer be accessed. When looking into the logfiles I see MESH_SAE_AUTH_FAILURE followed by MESH_SAE_AUTH_BLOCKED over and over again.

If I just restart the lost mesh node, it will connect only for a few minutes, before getting lost again. Only when restarting all mesh nodes, I will get into the stable state, which lasts for several weeks again.

kpoman commented 3 years ago

Same issue here with 21.02 snapshot, 802.11s using wolf ssl on a TPLink Archer C6 v2

nemesifier commented 3 years ago

I observed this today again, I have two interfaces:

Mesh1 on 5GHz works, connection to the rest of the LAN works.

It looks to me that the mesh0 link gets disabled automatically for inactivity:

mesh0     ESSID: "*******"
          Access Point: ***********************
          Mode: Mesh Point  Channel: 11 (2.462 GHz)
          Center Channel 1: 11 2: unknown
          Tx-Power: 20 dBm  Link Quality: 46/70
          Signal: -64 dBm  Noise: unknown
          Bit Rate: 1.0 MBit/s
          Encryption: WPA3 SAE (CCMP)
          Type: nl80211  HW Mode(s): 802.11bgn
          Hardware: 14C3:7603 14C3:7603 [MediaTek MT7603E]
          TX power offset: none
          Frequency offset: none
          Supports VAPs: yes  PHY name: phy0

Look at Bit Rate: 1.0 MBit/s. I've looked around and it seems that's the result of the rate control algorithm of the mac80211 driver when it detects inactive wifi links.

So, could it be possible that in this case, the MESH_SAE_AUTH_FAILURE is just a result of the minstrel_ht rate control which tunes down this WiFi interface because it's not being used?

BTW, here's iw dev mesh0 station dump:

Station *********** (on mesh0)
    inactive time:  36 ms
    rx bytes:   111090
    rx packets: 1380
    tx bytes:   3584
    tx packets: 28
    tx retries: 8
    tx failed:  0
    rx drop misc:   173
    signal:     -64 [-64, -71] dBm
    signal avg: -63 [-63, -71] dBm
    Toffset:    29762706465 us
    tx bitrate: 1.0 MBit/s
    tx duration:    54336 us
    rx duration:    0 us
    airtime weight: 256
    mesh llid:  0
    mesh plid:  0
    mesh plink: BLOCKED
    mesh airtime link metric: -1
    mesh connected to gate: no
    mesh connected to auth server:  no
    mesh local PS mode: UNKNOWN
    mesh peer PS mode:  UNKNOWN
    mesh non-peer PS mode:  ACTIVE
    authorized: no
    authenticated:  no
    associated: no
    preamble:   long
    WMM/WME:    yes
    MFP:        yes
    TDLS peer:  no
    DTIM period:    2
    beacon interval:100
    connected time: 71 seconds
    associated at [boottime]:   0.000s
    associated at:  1625990421471 ms
    current time:   1626024600677 ms

Notice: mesh plink: BLOCKED.

thiagokokada commented 3 years ago

Adding my 2c at the discussion.

I was also suffering from MESH-SAE-AUTH-FAILURE while using 802.11s with a Xiaomi Mi Router 3G and a TP-Link RE200v4. One thing I noted during debugging is that TP-Link RE200v4 got really slow while doing SAE negotiation (even SSH'ing got visible slower while RE200v4 was doing negotiation). Looking at the top, it showed that /usr/sbin/hostapd -s -g /var/run/hostapd/global process was using 90~100% of CPU time. Xiaomi Mi Router 3G otherwise seemed completely fine during SAE negotiation, probably because it has a much more powerful CPU (and it is a 2 cores/4 threads).

So I decided to compile a custom OpenWrt image for RE200v4 (for now it is running 21.02.0-rc4, but I was running snapshot before with the same result), with -O3 optimization flag in place of the default -Os. You can change this on make menuconfig going in the Advanced configuration options (for developers) [enable it first] -> Target options [enable it first] -> Target optimizations. The result is something like this in the .config file:

CONFIG_DEFAULT_TARGET_OPTIMIZATION="-Os -pipe -mno-branch-likely -mips32r2 -mtune=24kc"
# ...
CONFIG_TARGET_OPTIONS=y
CONFIG_TARGET_OPTIMIZATION="-O3 -pipe -mno-branch-likely -mips32r2 -mtune=24kc"

After compilation, I got a much bigger image (before: aprox. 5.3MB, after: aprox. 6.3MB), because this makes basically all binaries bigger. However, the use of CPU during SAE negotiation decreased to 60~70% of CPU time on the RE200v4, and my mesh network finally connected (however I still got some MESH-SAE-AUTH-FAILURE errors, so this is probably only part of the issue).

So my hypothesis here is that part of the reason for MESH-SAE-AUTH-FAILURE errors is just because it takes so much CPU time to do the negotiation that maybe it fails with a timeout in some peers with weaker CPUs.

Unfortunately, I didn't found way to only compile some packages with -O3, so I couldn't compile only wpad-wolfssl with it. If this was a option I could reduce the size increase in the image. However, at least for me, since I don't need many packages on the RE200v4 it works.

P.S.: before someone ask, I also tried wpad-openssl with the same trick. It also reduced MESH-SAE-AUTH-FAILURE errors with it, however once I put load at the Wi-Fi connection all clients connected to it dropped (however, 802.11s mesh stayed connected, so I thought this was weird). P.S.2: I also tried -O2 in place of -O3, but I got mixed results. It seems to also help reduce MESH-SAE-AUTH-FAILURE errors (and I got it to successful connect to mesh too), but the CPU usage was also near 90% during SAE negotiation. For the positive side, this increased the image only by aprox. 0.5MB instead of 1MB.

thiagokokada commented 3 years ago

I decided to look at the top again with both -O2 and -O3. Now looking closely, I saw CPU usage peaks of 95% even with -O3. Also decided to disable ALSR [1], since this is one of the changes from OpenWrt 19.07 to 21.02 (and people seems to have more success with 19.07 than 21.02). CPU usage kinda stayed the same too.

Maybe there is a better way to measure performance than top, but I can't think of a way to do it.

But well, for me it works now when it doesn't before. Using for now -O2 with ASLR disabled and it connects kinda reliably (but still get some MESH-SAE-AUTH-FAILURE errors and even MESH-SAE-AUTH-BLOCKED sometimes).

[1]: keep in mind that this will probably decrease security. In my case I am using this only on an Access Point that isn't running any Internet exposed services. But on a device connected directly to the Internet, disabling ASLR is probably a bad idea.

nemesifier commented 3 years ago

I observed this today again, I have two interfaces:

  • mesh0 on 2GHz
  • mesh1 on 5GHz

Mesh1 on 5GHz works, connection to the rest of the LAN works.

It looks to me that the mesh0 link gets disabled automatically for inactivity:

mesh0     ESSID: "*******"
          Access Point: ***********************
          Mode: Mesh Point  Channel: 11 (2.462 GHz)
          Center Channel 1: 11 2: unknown
          Tx-Power: 20 dBm  Link Quality: 46/70
          Signal: -64 dBm  Noise: unknown
          Bit Rate: 1.0 MBit/s
          Encryption: WPA3 SAE (CCMP)
          Type: nl80211  HW Mode(s): 802.11bgn
          Hardware: 14C3:7603 14C3:7603 [MediaTek MT7603E]
          TX power offset: none
          Frequency offset: none
          Supports VAPs: yes  PHY name: phy0

Look at Bit Rate: 1.0 MBit/s. I've looked around and it seems that's the result of the rate control algorithm of the mac80211 driver when it detects inactive wifi links.

So, could it be possible that in this case, the MESH_SAE_AUTH_FAILURE is just a result of the minstrel_ht rate control which tunes down this WiFi interface because it's not being used?

BTW, here's iw dev mesh0 station dump:

Station *********** (on mesh0)
  inactive time:  36 ms
  rx bytes:   111090
  rx packets: 1380
  tx bytes:   3584
  tx packets: 28
  tx retries: 8
  tx failed:  0
  rx drop misc:   173
  signal:     -64 [-64, -71] dBm
  signal avg: -63 [-63, -71] dBm
  Toffset:    29762706465 us
  tx bitrate: 1.0 MBit/s
  tx duration:    54336 us
  rx duration:    0 us
  airtime weight: 256
  mesh llid:  0
  mesh plid:  0
  mesh plink: BLOCKED
  mesh airtime link metric: -1
  mesh connected to gate: no
  mesh connected to auth server:  no
  mesh local PS mode: UNKNOWN
  mesh peer PS mode:  UNKNOWN
  mesh non-peer PS mode:  ACTIVE
  authorized: no
  authenticated:  no
  associated: no
  preamble:   long
  WMM/WME:    yes
  MFP:        yes
  TDLS peer:  no
  DTIM period:    2
  beacon interval:100
  connected time: 71 seconds
  associated at [boottime]:   0.000s
  associated at:  1625990421471 ms
  current time:   1626024600677 ms

Notice: mesh plink: BLOCKED.

An update on this: I upgraded to OpenWrt 21.02 RC4 and added option cell_density '1' to the radio configuration, which should avoid the problem of minstrel HT tuning down the radio so much that the mesh would not be able to connect.

ghost commented 3 years ago

@nemesisdesign Hi, I'm suffering the same problem since 21.02.0-rcX releases. It seems to me too that the WiFi is somehow "throttled in a way" the mesh 802.11s handshake cannot take place.

When the mesh link dropped with the "blocked 300s" message, the disconnected status lasted according to our monitoring tools for about 1 hour until it recovered by itself.

I've now set "cell_density '1'" (UI setting: normal) on all mesh APs, but the situation did not get better.

Devices rebooted, I saw the following on station A (where station B is expected to connect wirelessly to).

After a short waiting time, a "slow" WiFi link came up - no ping - station B is not reachable yet. image

Then, the ping came, station B was reachable. About some minutes later, some ping packets got lost (whyever I don't know): image

I will now observe, if the link stability is better with "cell_density '1'" than before. But my feeling says no - it did not improve.

At the moment, the mesh link is "useable".

image

ghost commented 3 years ago

@nemesisdesign Today, my mesh link failed with the MESH-SAE-AUTH-BLOCKED duration=300 ms again for no reasons between two (and I don't have any more) wirelessly meshed TP-Link Archer C7v2/5 units. It recovered after some time about an hour by itself. Conclusion: "cell_density '1'" is NOT a solution to the problem. I've reverted the setting back to default (0).

Catfriend1 commented 3 years ago

@nemesisdesign I now left "cell_density" at "0" (default value) by omitting the option in the wireless config. Plus, I've now removed the "legacy_rates '0'" option I had in place before.

To sum up: With cell_density "0" and legacy_rates "1" (both are defaults in OpenWrt 21.02.0-rc.4) my mesh seems to run stable without interruption.

Thank you for pointing me into the right direction where to look at 👍 .

ghost commented 2 years ago

This is still an issue on OpenWrt 21.02.0-rc4 , ref.: https://forum.openwrt.org/t/having-trouble-with-sure-mesh-links Rebooting one AP of a two AP mesh in a test lab environment during a weekly planned maintenance window causes the "MESH-SAE-AUTH-BLOCKED, duration=300" error staying for a long time until it recovers by itself. Had rock solid stable mesh connection before every planned reboot and didn't loose a single PING. Weird.