Open pony1k opened 3 months ago
Maybe some race condition is happening here? Some time ago I remember there was something like the wrong interface being added to the bridge as the first one setting its MAC address to some harmful value, but I also think this was solved, maybe adding the dummy0 interface, cannot remember (and I found a message of mine mentioning that we could remove dummy0 altogether as it should not be needed anymore https://github.com/libremesh/lime-packages/issues/189#issuecomment-445493854).
Good idea! I think the race condition has been solved by not using dummy0
any longer but changing the mac address of all hardifs, so that the main mac address can never be the same as the one of br-lan
(which I think was the problem here). See https://github.com/libremesh/lime-packages/pull/726/commits/4ed70e565e2e7ce92ae38d24c34ca23e480ca97f. But maybe there is another race condition with the bridge fdb that somehow has to with the fact that the mac address of the other router is seen through two interfaces, both bat0 and the ethernet interface. I will try to figure this out in two weeks or so when I have time (If no one else has figured it out by then).
Amazing, you have a great memory :D
I believe that adding some documentation about how to configure the routers in this scenario can mitigate this issue, do you agree?
So, for the upcoming release we should either tackle this properly or (simpler) add comments in the website (on lime-docs' /docs/lime-example.txt @pony1k already added documentation in https://github.com/libremesh/lime-packages/pull/1085 ) indicating how to manage this.
Also, this interface-specific configuration should be exposed via lime-app, as it is veeery common for users to connect two LibreMesh devices via ethernet (I believe). Opinions? @selankon @javierbrk @G10h4ck
The issue #1008 is like a child of this one. I hope that when this one will be fixed, also #1008 will really be fixed.
Continuing #1118 here because I'm not sure if trying to detect switches between routers is the right way to go forward.
Let's discuss the issue described by @ilario in this mail: https://lists.autistici.org/message/20240714.140352.58fe57b2.en.html
In default configuration, ethernet interfaces are added to the br-lan bridge, while also being configured as batadv hard interface. In some setups, this leads to error messages appearing in the kernel log in a high rate and network instability.
It was suspected that the error appears iff there is a switch between the two routers. I tried to reproduce the issue with a dumb switch, but without success (everything working fine, no errors in kernel log), as described in this mail: https://lists.autistici.org/message/20240726.150840.dcc0e028.en.html
I then tried to reprduce it by replacing the switch with an OpenWrt-router (without DSA), basicly acting as a managable switch, with no sucess either.
Then when I connected the two LibreMesh routers directly, suprisingly I could observe the issue. I could observe the error messages in the kernel logs and batadv didn't mesh over ethernet. On
mr70x-v1
,batctl n
did not list thefritz4040
as neighbour on the lan interfaces, alsobatctl bbt
showed no routers in the backbone table on both routers.batctl tcpdump lan1_29
could see batman OGMs appearing on thelan1_29
, not sure why the interface was not showing up in the neighbour table. After a while, the wifi connection between my laptop and the routers became quite unusable. When I ran tcpdump on the mesh interfaces I found that there was a lot of broadcast and some frames were duplicated many times (I saw ICMPv6 messages with same id and seq-no many times over long time periods. Plus, on my laptop, I saw the same echo request being received over and over at a high rate. So there was a loop and that clogged the wifi interface.It is not the kind of loop I described in #1032 .
Later I booted the routers again, to further investigate the issue. Annoyingly, everything is working fine now. No kernel logs, meshing over ethernet works, no frames looping around. I'm not able to replicate the issue again. I also tried with resetting the configuration to firstboot state, but to no avail. So, unfortunatly, it is currently not possible for me to find out when excactly this happens and why.
I find it strange that batman is also configured on
eth0
on dsa enabled devices. I don't think we are supposed to use that directly. Next time someone observes this issue, maybe they could addto the
lime-node
file and see if it helps.