openNDS / mesh11sd

Mesh11sd is a dynamic parameter configuration daemon for 802.11s mesh networks.
GNU General Public License v2.0
33 stars 8 forks source link

check_gate and check_portal improvements #30

Closed chrdev closed 8 months ago

chrdev commented 10 months ago

I've recently noticed unstabilities and slowdowns in a SOHO mesh envirentment. Then I modified /usr/sban/mesh11sd, and the whole mesh has run for a week with no glitchs. So Here to share what I've found.

All code is from 3.0.0beta, earlier versions may share the same problems.

check_gate() {
  is_gate=$(iw dev | grep -w "type AP")
...
}

Since in practice almost all mesh nodes are APs as well, this line of code seems to lack the best judgement to give value to is_gate. Maybe "ip route" is a better approach. See below.

check_portal() {
...
  proto=$(uci get network.lan.proto)
  default_gw=$(ip route | grep "default via")
  authoritative=$(uci get dhcp.@dnsmasq[0].authoritative 2>/dev/null | awk '{printf "%d", $1}')

  if [ -z "$default_gw" ]; then
    is_portal=""
  else
    gw_ip=$(echo "$default_gw" | awk -F" " '{printf "%s", $3}')
    wan_ip=$(echo "$default_gw" | awk -F" " '{printf "%s", $7}')
    is_portal=$(ip addr | grep "$wan_ip")
  fi
...
}

check_portal() has multiple problems.

  1. $proto and $gw_ip are never used. Maybe they are leftovers?

  2. On a dumb AP, $default_gw is never empty, thus $is_portal is never empty, but $is_portal should be emty on a dumb AP. On one of My dumb AP node:

    ip route | grep "default via"
    default via 192.168.1.1 dev br-lan
  3. Because $default_gw could be something like 'default via 192.168.1.1 dev br-lan', there is no field 7, Thus $wan_ip is always empty, thus $is_portal is the whole output of "ip addr", which is clearly not this code's intention.

So, mesh11sd ver 3.0.0beta fails to detect gate and portal correctly. It sets all dumb AP nodes to gate and portal, renders the whole mesh unstable and unusable.

Here is my temporary remedy:

#proto=$(uci get network.lan.proto)
#default_gw=$(ip route | grep "default via")
#authoritative=$(uci get dhcp.@dnsmasq[0].authoritative 2>/dev/null | awk '{printf "%d", $1}')
#
#if [ -z "$default_gw" ]; then
#  is_portal=""
#else
#  gw_ip=$(echo "$default_gw" | awk -F" " '{printf "%s", $3}')
#  wan_ip=$(echo "$default_gw" | awk -F" " '{printf "%s", $7}')
#  is_portal=$(ip addr | grep "$wan_ip")
#fi

is_portal=''
default_gw=$(ip route | grep -m 1 -F 'default via' | cut -d ' ' -f 5)
# $default_gw is "br-lan" or "pppoe-wan", etc
if [ "$default_gw" != 'br-lan' ]; then
    is_portal='1'
fi

if [ -z "$is_portal" ]; then
  # This IS NOT a layer 3 mesh portal
  uci set mesh11sd.mesh_params.mesh_connected_to_as='0'
  uci set mesh11sd.mesh_params.mesh_connected_to_gate='0'
...
else
  # This IS a layer 3 mesh portal
  uci set mesh11sd.mesh_params.mesh_connected_to_as='1'
  uci set mesh11sd.mesh_params.mesh_connected_to_gate='1'

...

#   check_gate
    sleep $checkinterval

I also removed all lines that re-configure network and dnsmasq. They are the things that should not be tempered with lightly. Because dnsmasq may not even be enabled and running on a dumb AP. We must assume it's un-configured and avoid starting it.

For ver 3.0.0beta, also set portal_detect to 1 in file /etc/config/mesh11sd.

Once we modified /usr/sbin/mesh11sd, we stop it and start again. then verify if we are good. On router:

uci show mesh11sd

mesh11sd.mesh_params.mesh_connected_to_as='1'
mesh11sd.mesh_params.mesh_connected_to_gate='1'

On a mesh node, dumb AP:

uci show mesh11sd

mesh11sd.mesh_params.mesh_connected_to_as='0'
mesh11sd.mesh_params.mesh_connected_to_gate='0'

Now we have a stable mesh.

Suggestions for further improvements:

  1. Read mesh_connected_to_as and mesh_connected_to_gate from the config file. If the user set these parameters explicitly, respect them and skip auto-detection.

  2. Rewrite code for $is_portal. Consider using my method or more reliable method.

  3. If the user demands auto-config, only reconfigure dnsmasq if it's enabled. If it's disabled, only config wireless, but not dnsmasq.

  4. If portal_detect is set to 0, the default, in config file, assume the node is non-portal instead of is-portal. Because there are (much) more non-portal nodes than portal nodes. And we can assume a portal is well configured, any necessary parameters are set explicitly, but a non-portal node may use more defaults.

bluewavenet commented 10 months ago

@chrdev First of all, thank you for all your efforts, all feedback is much appreciated.

However, it must be remembered that a beta version is always a work in progress and not intended for serious use. The v3 beta on Github is in fact not far removed from v2.1.0beta which was superseded. Also much additional work on the project has been done and not yet pushed to github, such as channel tracking.

For context, the Mesh11sd project is an "open source" refactoring of a proprietary system and the refactoring goes in bursts of work fitted in around commercial development.

The purpose of portal detect is to allow the same build to be used on all mesh nodes. "Dumb APs" (that term is so very incorrect) , meaning an access point without any layer 3 routing, should never be added to a mesh11sd network that is set to auto-configure. ( https://openwrt.org/docs/guide-user/network/wifi/dumbap#wireless_access_point_aka_dumb_access_point )

I do not mean in any way to belittle your efforts, in fact your comments are valuable as it shows at the very least how important documentation is. Unfortunately you have somewhat missed the point of the new functionality provided by v3 onwards - but no matter, it is all valuable information.

I have had a quick look at your comments (I will look in more detail later) and will respond on some parts now to provide some clarification.

Gate and AS

  1. mesh_connected_to_as and mesh_connected_to_gate are only intended as settable parameters that match the actual configuration.
  2. "connected_to_gate" means the meshnode is also an AP
  3. "connected_to_as" means the meshnode has a directly connected upstream (Internet) feed. "as" stands for "authentication service", generally, but not necessarily, meaning a captive portal of some kind or "just an Internet feed".
  4. The gate and as parameters are auto set. Setting manually should never be required and in fact can break things.

Since in practice almost all mesh nodes are APs as well, this line of code seems to lack the best judgement to give value to is_gate.

Any mesh node that is also an AP is defined as a gate. So this line of code is all that is required.

Maybe "ip route" is a better approach.

From this and your later discussion, it seems you have missed the point of auto-configure, but we can discuss this later.

check_portal

check_portal is intended to determine if a mesh node has a directly connected upstream layer 3 link.

I also removed all lines that re-configure network and dnsmasq.

The whole purpose of check_portal is to determine if dnsmasq should be enabled or disabled.... (ie auto-configuring to something akin to a "dumb AP" with a mesh interconnect, as required.

Meshnode "Modes" A summary of modes -

A mesh portal with an upstream non-mesh feed: mesh11sd.mesh_params.mesh_connected_to_as='1'

A peer meshnode - mesh11sd.mesh_params.mesh_connected_to_gate='0'

A gateway meshnode ie a peer meshnode with an AP: mesh11sd.mesh_params.mesh_connected_to_as='0' and mesh11sd.mesh_params.mesh_connected_to_gate='1'

A portal meshnode with an AP: mesh11sd.mesh_params.mesh_connected_to_as='1' and mesh11sd.mesh_params.mesh_connected_to_gate='1'

We can go into more detail if you wish, but I would suggest you test v3 using the following configuration:

On each "router" to be configured, flash with the latest stable default OpenWrt image.

Install mesh11sd, on each router in turn, with an ethernet connection to your ISP router - disconnecting the ethernet when finished. Do NO OTHER configuration.

When all the routers are done, connect one back up to the ISP router. This will be the portal mesh node.

Power up all the other mesh nodes and wait a few minutes for the meshnodes to autoconfigure and the mesh to establish.

Test! The beta on Github should work.....

chrdev commented 10 months ago

Thank you so much for your great work, and for this swifty reply.

Lacking of documentation is denfitly the culprit of many things.

Your mesh modes summary clarifies things a lot and should be well documented.

However about mesh_connected_to_gate, I still don't understand two things.

Firstly, connected_to_X normally means that something itself is not X, but it connects to that X. However here a node connected_to_X means the node itself IS the X. Maybe better it were named mesh_is_X, but I can accept as what it is.

Secondly, gate dosen't mean gateway which translates net addresses, but it means AP. I don't quite follow. Moreover, now except for the node which is the router, gateway and DHCP server, all my mesh nodes are set by me mesh_connected_to_gate=0, why the mesh works smoothly? So mesh_connected_to_gate=1 or 0 doesn't make any differences in practice?

About mesh_connected_to_as, I am deeper in the mist. The code determines whether the node is "AS" by only checking dnsmasq config of "authoritative".

check_portal() {
...
    authoritative=$(uci get dhcp.@dnsmasq[0].authoritative 2>/dev/null | awk '{printf "%d", $1}')
...

And dnsmasq.authoritative maps to --dhcp-authoritative, we can check /etc/inti.d/dnsmasq to confirm it.

dnsmasq man reads:

--dhcp-authoritative
  Should be set when dnsmasq is definitely the only DHCP server on a network. For DHCPv4, it changes the behaviour from strict RFC compliance so that DHCP requests on unknown leases from unknown hosts are not ignored. This allows new hosts to get a lease without a tedious timeout under all circumstances. It also allows dnsmasq to rebuild its lease database without each client needing to reacquire a lease, if the database is lost. For DHCPv6 it sets the priority in replies to 255 (the maximum) instead of 0 (the minimum).

So it seems that mesh_connected_to_as=1 doesn't mean the node has a upstream non-mesh feed, but it means the node is a DHCP server. That doesn't seem quite right.

Finnaly, what if we install mesh11sd and do no other configuration?

      is_portal=$(ip addr | grep "$wan_ip")

Because $wan_ip" is empty, $is_portal is the full output of "ip addr", which is never empty, so the node will always be portal, be "AS", and because it's an AP, it will always be "gate".

if [ -z "$is_portal" ]; then
  ...
else
    # This IS a layer 3 mesh portal
    uci set mesh11sd.mesh_params.mesh_connected_to_as='1'

    if [ "$authoritative" -eq 0 ] || [ -z "$authoritative" ]; then
        debugtype="debug"
        syslogmessage="This meshnode is an upstream portal"
        write_to_syslog

        uci set dhcp.@dnsmasq[0].authoritative='1'
        uci set dhcp.lan.ignore='0'
        uci set network.lan.stp='1'
        uci set dhcp.@dnsmasq[0].quietdhcp='1'
        /etc/init.d/network restart
        /etc/init.d/dnsmasq restart
    fi

Now we come to the else section, but $authoritative is 1 by default, we can check /etc/config/dhcp to confirm. So the code that set and restart network and dnsmasq will not run, by accident. Now we have a working mesh, somewhat by accident.

If the user set dnsmasq authoritative to 0 on the nodes, like I did, although dnsmasq was disabled and not running, the result is a bunch of authoritative DHCP servers running in the network, render it unstable.

For management purposes, we normally want static IPs for mesh nodes. So Do NO OTHER configuration may not the best practice we do. Maybe it's better to assume the mesh node is also dumb AP? Which means the node has a static IP, is AP, and no dnsmasq runnig.

Again, thank you for your greate work!

chrdev commented 10 months ago

I am baffled by the namings and definitions, I can't sleep! Haha. so I digged a little deeper.

linux/include/net/cfg80211.h

/*
* @dot11MeshConnectedToAuthServer: if set to true then this mesh STA
*   will advertise that it is connected to a authentication server
*   in the mesh formation field.
*
* @dot11MeshConnectedToMeshGate: if set to true, advertise that this STA is
*   connected to a mesh gate in mesh formation info.  If false, the
*   value in mesh formation is determined by the presence of root paths
*   in the mesh path table
*/
iw/nl80211.h

/*
* @NL80211_MESHCONF_CONNECTED_TO_GATE: If set to true then this mesh STA
*   will advertise that it is connected to a gate in the mesh formation
*   field.  If left unset then the mesh formation field will only
*   advertise such if there is an active root mesh path.
*
* @NL80211_MESHCONF_CONNECTED_TO_AS: If set to true then this mesh STA
*   will advertise that it is connected to a authentication server
*   in the mesh formation field.
*/

They tell more or less the same message, which still leave me in the dark. Then I came across this good 3534-pages 802.11-2016.pdf

Excerpt from 3. Definitions, acronyms, and abbreviations, 802.11-2016.pdf

access point (AP): An entity that contains one station (STA) and provides access to the distribution services, via the wireless medium (WM) for associated STAs. An AP comprises a STA and a distribution system access function (DSAF).

distribution system access function (DSAF): A function within an access point (AP) or mesh gate that uses the medium access control (MAC) service and distribution system service (DSS) to provide access between the distribution system (DS) and the wireless medium (WM).

mesh gate: Any entity that has a mesh station (STA) function and a distribution system access function (DSAF) to provide access to a single distribution system for the mesh basic service set (MBSS).

authentication: The service used to establish the identity of one station (STA) as a member of the set of STAs authorized to associate with another STA.

Authentication Server (AS): An entity that provides an authentication service to an Authenticator. This service determines, from the credentials provided by the Supplicant, whether the Supplicant is authorized to access the services provided by the Authenticator. (IEEE Std 802.1X-201015)

Based on this information, if I understand it right, in a typical SOHO environment which doesn't use radius, for optimal performance, every mesh node should advertise itself as gate and also AS.

My previous configuration told all nodes not to advertise it's gate nor AS. This left the authentication work to the main router, and because there was an active root mesh path, so it worked, only not optimal, but worked.

I'll try to rewrite /usr/sbin/mesh11sd to set all nodes mesh_connected_to_gate='1' and mesh_connected_to_as='1'.

However, again, I insist dnsmasq should be left well alone if not enabled. A mesh authentication server has nothing to do with dnsmasq --dhcp-authoritative swith. We don't want a bunch of authoritative DHCP servers running in our network, do we?

bluewavenet commented 10 months ago

@chrdev

Firstly, connected_to_X normally means that something itself is not X, but it connects to that X.

Correct. mesh_connected_to_as is a short form of "This meshnode is directly connected to an authentication server" The 802.11s standard was developed (back in the mid 2000s) with community type infrastructure in mind, requiring some sort of authentication.

That authentication was back then considered most likely to be radius or similar. These days it is most likely to be a captive portal (like openNDS), but could also be the trivial case of "unrestricted access".

The software module that does the "AS" part is a logical unit and can be running locally on the mesh portal or anywhere else, or not even exist in the unrestricted case.

We may agree, or not, with the historical naming convention used by the standard, but once you know what it means, it does male sense.

Secondly, gate dosen't mean gateway which translates net addresses

You are confusing layer 2 with layer 3+. In a layer 3 ip network, yes GATEWAY means the ip address to use to be routed elsewhere (the capitals are mine to emphasise special meaning)

In an 802.11s mesh, which is entirely layer 2, the term "gateway" is as in the English language ie it is an entrance/exit (note my lower case).

So the terms "mesh gate" or "mesh gateway" are synonymous with a "mesh peer node that also has an access point for non-mesh devices to connect to".

For management purposes, we normally want static IPs for mesh nodes. So Do NO OTHER configuration may not the best practice we do

For management purposes, mesh11sd uses its own ipv6 network based on the mac addresses of the mesh nodes. See: https://github.com/openNDS/mesh11sd#7-command-line-interface This is new to v3 and as yet is only mentioned in the readme - (documentation needed).

$authoritative is 1 by default,

Yes, this is intentional. The default power up mode should be "mesh portal". Mesh11sd then checks if the meshnode has an upstream ip connection. If it does not, it reconfigures as a "mesh peer" or a "mesh gate" (translating into your terminology "mesh gate" means a dumb AP with a mesh link) .

So:

  1. a "mesh portal" has dnsmasq automatically enabled.
  2. a "mesh gate" or a "mesh peer" has dnsmasq automatically disabled

In addition, version 3 has channel tracking. All mesh peers (and gates) will track the channel used by the "mesh portal" and adjust accordingly. If you change the channel on the mesh portal, this will ripple out to all other mesh nodes.

Based on this information, if I understand it right, in a typical SOHO environment which doesn't use radius, for optimal performance, every mesh node should advertise itself as gate and also AS.

I'll try to rewrite /usr/sbin/mesh11sd to set all nodes mesh_connected_to_gate='1' and mesh_connected_to_as='1'.

I insist dnsmasq should be left well alone if not enabled. A mesh authentication server has nothing to do with dnsmasq --dhcp-authoritative swith. We don't want a bunch of authoritative DHCP servers running in our network, do we?

Your insistence is born of a lack of understanding (due in turn, for the most part, to the current lack of documentation for the v3 beta).

Up to now you have not grasped the concept and are stuck in the world of "dumb APs" and ipv4 layer 3 networking. I hope this all helps to understand the concept that v3 is enabling.

chrdev commented 10 months ago

Thank you bluewavenet, for your kind explanation. Some of my mist cleared, I think what I really need is that portal_detect=0 means non-portal.

Firstly, it's intuitive that if the user don't want to detect something, it means the user don't want that something. But Ver 3.0.0beta does the opposite, portal_detect=0 means is_portal=1, that's conter-intuitive.

Secondly, this design breaks working configurations. As for ver 2.0.0 which shipped with openwrt, if the user set portal_detect=0, mesh11sd doesn't do anything, but ver 3.0.0beta starts dnsmasq service and set the device authoritative DHCP server. This behavior may result in multiple authoritative DHCP servers in the network and break it.

So, please reconsider the design of portal_detect. Again, thank you very much.

Since I misunderstood some basic concepts about mesh. I leave the following hopefully all-correct messages to whom may need them.

On Gate

Excerpt from 4.3.20.4 IEEE 802.11 components and mesh BSS, 802.11-2016.pdf

a mesh STA is not a member of an IBSS or an infrastructure BSS. Consequently, mesh STAs do not communicate with nonmesh STAs. ... However,... mesh STAs can communicate with nonmesh STAs. Therefore, a logical architectural component is introduced in order to integrate the MBSS with the DS — the mesh gate.

When an MBSS accesses the IEEE 802.11 DS through its mesh gate, the MBSS can be integrated with a non-IEEE-802.11 LAN. To integrate the IEEE 802.11 DS to which this MBSS connects, the DS needs to contain a portal. See 4.3.7. Consequently, mesh gate and portal are different entities. The portal integrates the IEEE 802.11 architecture with a non-IEEE-802.11 LAN (e.g., a traditional wired LAN), whereas the mesh gate integrates the MBSS with the IEEE 802.11 DS.

This explains clearly what a gate is, and what a portal is.

If the terms cause a headache, we can draw equal signs to help understanding, please expect less technical precision.
MBSS = mesh STAs = mesh communications = mesh IDs
IBSS = DS = IEEE 802.11 DS = non-mesh STAs = AP = WiFi SSIDs

mesh and AP don't talk to each other in nature, but a gate allows them to talk to each other. So This code is correct:

check_gate() {
# "-m 1 -o" added for optimization
  is_gate=$(iw dev | grep -m 1 -o -w "type AP")

  if [ -z "$is_gate" ]; then
    uci set mesh11sd.mesh_params.mesh_connected_to_gate='0'
  else
    uci set mesh11sd.mesh_params.mesh_connected_to_gate='1'
fi
}

If a device is a mesh station and also an AP, we would want to set mesh_connected_to_gate=1. This indicates that the station has a mesh path to a gate. but in some special cases we can also set mesh_connected_to_gate=0 if we know what we are doing. It's totally legit.

On AS

Excerpt from 12.6.1.3.4 Security association in an MBSS, 802.11-2016.pdf

In order to create a secure peering, mesh STAs first authenticate each other and create a mesh PMKSA. This can be done using either SAE or IEEE Std 802.1X.

When...(using) sae, (details described, no AS mentioned)...

When... (using) ieee8021x... IEEE 802.1X authentication shall be performed between the two peers according to the following:

a) If only one mesh STA has the Connected to AS field set to 1, that STA shall act as the IEEE 802.1X Authenticator and the other STA shall act as the IEEE 802.1X Supplicant;

b) If both mesh STAs have the Connected to AS field set to 1, then the mesh STA with the higher MAC address shall act as the IEEE 802.1X Authenticator and the other mesh STA shall act as the IEEE 802.1X Supplicant (see 12.7.1 for MAC address comparison)...

"AS" means IEEE 802.1X authenticate server. If we use SAE, mesh_connected_to_as is irrelevant, it's harmless to set though.

bluewavenet commented 10 months ago

@chrdev

portal_detect=0 means non-portal.

No, it does not. It means portal detection is disabled. This is 100% intuitive ie "1" means "yes" and "0" means "no".

Secondly, this design breaks working configurations.

It is not intended to break anything, but this is a beta version, under current testing so time will tell.

However it might break something used in a previous version. This is why the version was jumped to v3.x.x (see Semantic Versioning https://semver.org/ ) On release perhaps this should be made clear for the benefit of people who are unaware of Semantic versioning.

ver 3.0.0beta starts dnsmasq service and set the device authoritative DHCP server. This behavior may result in multiple authoritative DHCP servers in the network and break it.

If you set portal_detect=0, then yes it will break the network because a node will make no attempt to find out if it is a portal or not. The config option portal_detect is provided for users to cater for special edge cases and normally should not be changed from the default of "1" unless you know what you are doing with a very special use case.

So, please reconsider the design of portal_detect. Again, thank you very much.

Since I misunderstood some basic concepts about mesh.

I am afraid you are sill misunderstanding even the basic concept. The portal_detect functionality is working as designed, thank you very much. /s

but in some special cases we can also set mesh_connected_to_gate=0 if we know what we are doing. It's totally legit.

Sure, but then:

  1. the mesh will not know there is an AP present
  2. the mesh node will not advertise the presence of the AP

if we know what we are doing

Well now, there is a statement. I am happy to help you get to the stage where you know what you are doing if you wish.

On AS

802.11x is a protocol that can be used for authentication. The document you quoted is concerned with early attempts to add authentication and encryption to the 802.11s mesh standards.

As far as I am aware it was never adopted, instead sae/aes authentication and encryption was implemented and this is what is now used in 802.11s networks as standard. See: https://en.wikipedia.org/wiki/Advanced_Encryption_Standard

and

https://en.wikipedia.org/wiki/Simultaneous_Authentication_of_Equals

AS means "Authentication Server". So a portal meshnode will be "connected to an authentication server" eg a captive portal or, in the simple case of a domestic (home) network, connected to an open Internet feed.

You are most likely confusing the autonomous and invisible meshnode peering authentication/encryption process and "Authentication" requirements for a USER to access the Internet.

mesh_connected_to_as is irrelevant

mesh_connected_to_as is the parameter used by a meshnode to advertise the fact it is a portal and is a vital parameter used for auto configuration.

I apologise for my sarcasm, but you do seem very insistent. I am very happy to help explain the inner workings and once you understand it, your feedback and suggestions will be very welcome.

I am well aware that there are many papers and articles that can be found that for the most part are "suggested enhancements", "proposed methods", or just plain outdated (I have probably read them all). These just add to the confusion of course as they have either never been relevant, or no longer are.