openNDS / mesh11sd

Mesh11sd is a dynamic parameter configuration daemon for 802.11s mesh networks.
GNU General Public License v2.0
28 stars 4 forks source link

check_portal always disables dhcp on standard routers #31

Closed staples1347 closed 1 month ago

staples1347 commented 5 months ago

I've just started using mesh11sd 2.0.0 on openwrt and it looks like check_portal never detects a portal (at least on normal routers) so then dhcp is disabled on the router and since this is the default setting it can cause problems for clients on a default install. The master version of script looks like it also might have this problem.

This is caused by the line: wan_ip=$(echo "$default_gw" | awk -F" " '{printf "%s", $7}') which doesn't seem to be correct. Here are two examples where this line fails for the 7th string which then also causes is_portal=$(ip addr | grep "$wan_ip") to end up with an empty string: default via dev pppoe-wanppp proto static default dev qmimux0 proto static scope link src metric 1

Changing awk to use the 9th parameter would fix the problem for the second default route example, but for the first default route example, a different method would need to be used to lookup the wan ip.

Is this a bug or do captive portals configure things a bit differently?

bluewavenet commented 5 months ago

@staples1347 Thank you for your in depth look - always appreciated.

To look into this further we need to see some information about your setup, like the router model, version of openwrt and the basic config files.

it looks like check_portal never detects a portal (at least on normal routers)

That is a bit of an all encompassing statement! Perhaps you mean it does not work on your particular router ;-)

The term "portal" is used to describe if a mesh node has a "wan" connection to an upstream router, so the intention is to switch to standard "peer" mode if the upstream is not detected, but I am sure you are aware of this.

To date, the only router type tested where it has been found that this fails is those based on the mt7628 soc (eg gl-inet mt300n-v2). This occurs when wan ethernet is not connected but lan ethernet is. This has been fixed by a refactoring of the check_portal code. This has not yet been pushed to or merged into master on Github.

In your first example: default via <the_isp_gw> dev pppoe-wanppp proto static

This looks very odd. It could be a result of your basic config, or it could be something else - we need to see those configs ;-)

Have you tried setting option portal_detect '0' on this device?

staples1347 commented 5 months ago

I'm using OpenWRT 23.05.2. One of the routers (live config) is a TP-Link Archer C7 v5.0 (ath79-generic) with PPPoE wan config to the ISP and dhcp running on br-lan. wanppp is just a custom name I gave to the interface as it links to a secondary wan port with the primary port using dhcp client for additional flexibility. Another router I tested is a Asus RT-AX53U (ramips-mt7621) with gateway configured on br-lan and dhcp also running on br-lan.

Setting option portal_detect '0' would bypass this problem so would probably resolve it, but it isn't the default config of mesh11sd and also isn't mentioned as needed at https://openwrt.org/docs/guide-user/network/wifi/mesh/mesh11sd which is why I posted this bug report.

Are you able to post an example default route that would properly detect a captive portal?

Here is part of the network config of the TP-Link router:

config interface 'loopback'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'
        option device 'lo'
config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'eth0.1'
config interface 'lan'
        option proto 'static'
        option ipaddr '<lan_ip>'
        option netmask '255.255.255.0'
        option device 'br-lan'
config interface 'wan'
        option proto 'dhcp'
        option device 'eth0.2'
config interface 'wanppp'
        option proto 'pppoe'
        option username '<pppoe_username>'
        option password '<pppoe_password>'
        option keepalive '6'
        option defaultroute '1'
        option peerdns '1'
        option device 'eth0.3'
config device 'wan_eth0_2_dev'
        option name 'eth0.2'
        option macaddr '<mac_address>'
config interface 'wan6'
        option proto 'dhcpv6'
        option device 'eth0.2'
config switch
        option name 'switch0'
        option reset '1'
        option enable_vlan '1'
config switch_vlan
        option device 'switch0'
        option vlan '1'
        option ports '2 3 4 0t'
config switch_vlan
        option device 'switch0'
        option vlan '2'
        option ports '1 0t'
config switch_vlan
        option device 'switch0'
        option vlan '3'
        option ports '5 0t'

Here is the network config of the Asus router:

config interface 'loopback'
        option device 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'
config globals 'globals'
        option packet_steering '1'
config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'wan'
config interface 'lan'
        option device 'br-lan'
        option proto 'static'
        option ipaddr '<lan_ip>'
        option netmask '255.255.255.0'
        option gateway '<main_router_ip>'
        option dns '<main_router_ip>'
config interface 'wan'
        option proto 'dhcp'
config interface 'wan6'
        option proto 'dhcpv6'
        option reqaddress 'try'
        option reqprefix 'auto'
staples1347 commented 5 months ago

I've also just checked a TP-Link Archer C7 5.0 router being used for guest wifi running OpenWRT 21.02.3 that is using dhcp client on wan and it has the wan ip address at the 9th field, not the 7th: default via <gw_ip> dev eth0.2 proto static src <wan_ip>

Here is the network config for that router:

config interface 'loopback'
        option ifname 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'
config interface 'lan'
        option type 'bridge'
        option ifname 'eth0.1'
        option proto 'static'
        option ipaddr '<lan_ip>'
        option netmask '255.255.255.0'
config interface 'wan'
        option ifname 'eth0.2'
        option proto 'dhcp'
config device 'wan_eth0_2_dev'
        option name 'eth0.2'
        option macaddr '<mac_addr>'
config interface 'wan6'
        option ifname 'eth0.2'
        option proto 'dhcpv6'
config switch
        option name 'switch0'
        option reset '1'
        option enable_vlan '1'
config switch_vlan
        option device 'switch0'
        option vlan '1'
        option ports '2 3 4 5 0t'
config switch_vlan
        option device 'switch0'
        option vlan '2'
        option ports '1 0t'
staples1347 commented 5 months ago

Ah I just figured it out. I am doing custom OpenWRT builds using Image Builder and I add the non-busybox version of ip in those builds. Sorry for assuming it was a problem on standard routers.

If I run "busybox ip route | grep default" on the router that is using the dhcp client on wan it returns the wan ip as the 7th parameter: default via <gw_ip> dev eth0.2 src <wan_ip> . Also, the earlier qmimux0 example (on a Teltonika RUTX11 using Teltonika firmware) is also resolved by using busybox ip: default dev qmimux0 scope link src <wan_ip> metric 2

This still doesn't help for the routers using PPPoE on wan though: default via <gw_ip> dev pppoe-wanppp or routers using a static wan config: default via <gw_ip> dev br-lan or a better static wan example where it is connected to a business class isp (not using this package, just an example where it would fail the test): default via <isp_gw> dev wan.101 metric 10 so some additional error checking might be needed in those cases. But adding busybox in front of ip should help in a lot of cases if that works okay on normal OpenWRT builds.

staples1347 commented 5 months ago

Actually looking into it more , busybox ip addr | grep "" actually returns content, so it's just when the wan_ip variable has been filled with the word: static or other non ip address strings that is_portal doesn't get a result. So adding busybox in front of ip should fix most of problems and if it works on stock openwrt firmware would be an easy fix for the 2.0.0 series, although some additional error checking in between is_wan=... and is_portal=... might help as well.

bluewavenet commented 5 months ago

@staples1347 Thanks for the investigations! So in conclusion, this is only a problem if you have both ip-full installed AND are using a PPPoE wan connection?

Can you show a comparison between ip route and busybox ip route with ip-full installed?

I also merged the latest beta into master on github. It might be worth checking it out to see how that works....

staples1347 commented 5 months ago

It is a problem for routers that have both ip-full installed and are either using a PPPoE wan connection or are using a static wan connection.

Here is ip route output with some ip addresses filtered out and vpn routing info:

default via <isp_gw_ip> dev pppoe-wanppp proto static
<remote_vpn_network_1>/24 via <remote_vpn_ip> dev wgvpn1 proto bird metric 32
<remote_vpn_network_2>/28 via <remote_vpn_ip> dev wgvpn1 proto bird metric 32
<lan_network>/24 dev br-lan proto kernel scope link src <lan_ip>
<lan_network>/24 dev br-lan proto bird scope link metric 32
<isp_gw_ip> dev pppoe-wanppp proto kernel scope link src <wan_ip>

Here is busybox route output with some ip addresses filtered out and vpn routing info:

default via <isp_gw_ip> dev pppoe-wanppp
<remote_vpn_network_1>/24 via <remote_vpn_ip> dev wgvpn1  metric 32
<remote_vpn_network_1>/28 via <remote_vpn_ip> dev wgvpn1  metric 32
<lan_network>/24 dev br-lan scope link  src <lan_ip>
<lan_network>/24 dev br-lan scope link  metric 32
<isp_gw_ip> dev pppoe-wanppp scope link  src <wan_ip>
bluewavenet commented 5 months ago

@staples1347

Here is ip route output

That's useful thanks - I don't currently have a means of testing with pppoe - I can at least simulate this now. It looks, subject to more testing/simulating, that it is fixed in the latest v3-beta (not yet pushed to Github). I'll have time in a day or so to test/push/merge......

AcidSlide commented 2 months ago

I would like to confirm this is also happening on two of my routers (same brand/model), since December 2023 (i think) but only after from the following scenario.

bluewavenet commented 2 months ago

@AcidSlide If a router is not connected to its upstream feed (eg an Internet connection present on its wan port), by default, mesh11sd will switch to mesh peer mode (this includes turning off the dhcp server). If you want to force running in mesh portal mode (the mode it would have if it had upstream wan), just configure option portal_detect to 1

AcidSlide commented 2 months ago

If a router is not connected to its upstream feed (eg an Internet connection present on its wan port), by default, mesh11sd will switch to mesh peer mode (this includes turning off the dhcp server).

It is connected to an upstream with internet connection.. the issue here is it sets the br-lan DHCP to 'ignore this interface' upon first boot based on the conditions I've mentioned (from sysupgrade or when uplodaing a config archive and applyting it).

There is also a problem with Kernel 6.6 as it switches the system to mesh node even before the router has time to connect to the usptream connection. This is actually the reason I checked the existing issues that is when I saw this. I'll post a separate issue with regards to this when I have time to re-connect my UART to one of my test routers. At first I though the test kernel 6.6 had an issue but after flashing on another router using qualcommax (which usues the 6.6 as the default kernel), that is when I noticed the issue. See my comment here: https://github.com/openwrt/openwrt/commit/b7a900782840de767e2aa751fbaf1575dc6abda4#commitcomment-140685549

bluewavenet commented 2 months ago

@AcidSlide

the issue here is it sets the br-lan DHCP to 'ignore this interface' upon first boot

If the wan has not come up yet then yes it will, but as soon as the wan does come up it will turn dhcp back on.