libremesh / lime-packages

LibreMesh packages configuring OpenWrt for wireless mesh networking
https://libremesh.org/
GNU Affero General Public License v3.0
280 stars 96 forks source link

Problems with ethernet connections between routers in default configuration #1121

Open pony1k opened 3 months ago

pony1k commented 3 months ago

Continuing #1118 here because I'm not sure if trying to detect switches between routers is the right way to go forward.

Let's discuss the issue described by @ilario in this mail: https://lists.autistici.org/message/20240714.140352.58fe57b2.en.html

In default configuration, ethernet interfaces are added to the br-lan bridge, while also being configured as batadv hard interface. In some setups, this leads to error messages appearing in the kernel log in a high rate and network instability.

It was suspected that the error appears iff there is a switch between the two routers. I tried to reproduce the issue with a dumb switch, but without success (everything working fine, no errors in kernel log), as described in this mail: https://lists.autistici.org/message/20240726.150840.dcc0e028.en.html

I then tried to reprduce it by replacing the switch with an OpenWrt-router (without DSA), basicly acting as a managable switch, with no sucess either.

Then when I connected the two LibreMesh routers directly, suprisingly I could observe the issue. I could observe the error messages in the kernel logs and batadv didn't mesh over ethernet. On mr70x-v1, batctl n did not list the fritz4040 as neighbour on the lan interfaces, also batctl bbt showed no routers in the backbone table on both routers. batctl tcpdump lan1_29 could see batman OGMs appearing on the lan1_29, not sure why the interface was not showing up in the neighbour table. After a while, the wifi connection between my laptop and the routers became quite unusable. When I ran tcpdump on the mesh interfaces I found that there was a lot of broadcast and some frames were duplicated many times (I saw ICMPv6 messages with same id and seq-no many times over long time periods. Plus, on my laptop, I saw the same echo request being received over and over at a high rate. So there was a loop and that clogged the wifi interface.

It is not the kind of loop I described in #1032 .

Later I booted the routers again, to further investigate the issue. Annoyingly, everything is working fine now. No kernel logs, meshing over ethernet works, no frames looping around. I'm not able to replicate the issue again. I also tried with resetting the configuration to firstboot state, but to no avail. So, unfortunatly, it is currently not possible for me to find out when excactly this happens and why.

I find it strange that batman is also configured on eth0 on dsa enabled devices. I don't think we are supposed to use that directly. Next time someone observes this issue, maybe they could add

config net
    option linux_name 'eth0'
    list protocols 'manual'

to the lime-node file and see if it helps.

ilario commented 3 months ago

Maybe some race condition is happening here? Some time ago I remember there was something like the wrong interface being added to the bridge as the first one setting its MAC address to some harmful value, but I also think this was solved, maybe adding the dummy0 interface, cannot remember (and I found a message of mine mentioning that we could remove dummy0 altogether as it should not be needed anymore https://github.com/libremesh/lime-packages/issues/189#issuecomment-445493854).

pony1k commented 3 months ago

Good idea! I think the race condition has been solved by not using dummy0 any longer but changing the mac address of all hardifs, so that the main mac address can never be the same as the one of br-lan (which I think was the problem here). See https://github.com/libremesh/lime-packages/pull/726/commits/4ed70e565e2e7ce92ae38d24c34ca23e480ca97f. But maybe there is another race condition with the bridge fdb that somehow has to with the fact that the mac address of the other router is seen through two interfaces, both bat0 and the ethernet interface. I will try to figure this out in two weeks or so when I have time (If no one else has figured it out by then).

ilario commented 3 months ago

Amazing, you have a great memory :D

ilario commented 6 days ago

I believe that adding some documentation about how to configure the routers in this scenario can mitigate this issue, do you agree?

So, for the upcoming release we should either tackle this properly or (simpler) add comments in the website (on lime-docs' /docs/lime-example.txt @pony1k already added documentation in https://github.com/libremesh/lime-packages/pull/1085 ) indicating how to manage this.

ilario commented 6 days ago

Also, this interface-specific configuration should be exposed via lime-app, as it is veeery common for users to connect two LibreMesh devices via ethernet (I believe). Opinions? @selankon @javierbrk @G10h4ck

ilario commented 6 days ago

The issue #1008 is like a child of this one. I hope that when this one will be fixed, also #1008 will really be fixed.