svlsResearch / ha-mikrotik

High availability code for Mikrotik routers
155 stars 44 forks source link

add support for bonding interfaces and patch HA_VRRP bug with new bridge mode #17

Closed fflo closed 1 year ago

fflo commented 4 years ago

Hey Nathan,

thanks for your AWESOME project! Wished this issue would get more official support from Mikrotik.

Attached my patches to support bonding interfaces and to fix some issues with HA_VRRP using the new bridge mode.

Tested on 2x CCR1072-1G-8S+ equipment using ether1 as HA-interface using firmware v6.45.8.

-FF

nathanfaber commented 4 years ago

Hi,

Thanks for the PR. Few comments...

  1. For the bonding support, what is your topology that requires that? My assumption is that everything starts with ethernet interfaces but I'm guessing you have some other type inside of those bonds that is causing a problem?
  2. Disabling VRRP early seems reasonable, thanks. The disabling the loop is effectively closing the window a tiny bit since it does it every loop, maybe we should just remove the delay and get there as soon as possible?
fflo commented 4 years ago

Hi Nathan,

  1. CCRs are connected to an IRF HA switch stack, in detail using a dynamic LACP link aggregation with multiple 10GigE. Using this setup, the CCR can route traffic without interruption in the case of Jbic, cabling and single switch outage. The setup also greatly reduces the complexity of cabling and device configuration, because all active routed interfaces on the HA CCRs are simply a Vlan on the configured bonding interface; i.e. bond0.[vlan#]
  2. For whatever reason, it did not work running the command "disable HA_VRRP bridge" before the loop on CCR1072-1G-8S+ equipment. Seems that the bridge interface is also not ready (loaded) before the loop starts. For debugging I have moved the command into the loop to get it executed once every second which did resolve the issue.
nathanfaber commented 4 years ago
  1. This design makes sense, I do something similar but what I am confused by is why we need to explicitly handle the bonds in ha-mikrotik. It disables all of the non-$haInterface ethernet interfaces already, which should include your 10G underlying components. Does this not sufficiently disable the bond? I'm guessing not since you needed to add this but I don't understand why disabling all of the ethernet interfaces isn't sufficient, is it clear to you why the bonds need to be handled explicitly?
  2. Yes, this makes sense, the startup code runs really early, which is why we need all of that code.
nathanfaber commented 4 years ago

So updating on the bonding, I just did a brief test with a bond and adding two underlying ethernet interfaces. Using link-monitoring=mii, when all of the underlying interfaces disabled, it is no longer marked R running. This seems like it should work without explicitly disabling the bonds, what am I missing? Are you using an alternate link-monitoring? (none?) With the underlying interfaces disabled, the LACP shouldn't function at all to the switch so I can't figure out the scenario where this goes bad.

I'm not against your patch, I'm just trying to keep ha-mikrotik as minimal as possible to reduce the bug/maintenance surface.

Thanks

fflo commented 4 years ago

Hi Nathan,

if you do not disable the bond immediately on startup the bonding interface becomes active for a short period of time during reboot of the backup CCR.

I guess that's the case because the bonding interface is active and available until ALL underlying interface members are disabled by the script.

I have configured dynamic routing using OSPF on a Vlan and it did break the routing exchange before patching the bond and HA_VRRP bridge because the active CCR receives packets originating from itself and peer receives duplicate packets originating from both CCRs for about 1-2 seconds on each backup CCR reboot. Disabling the bond and HA_VRRP bridge on startup did resolve the issue for me.

I think it makes sense to enable the bond only in case of becoming an active master after the underlaying ethernet interfaces have been enabled already.

To further reduce the failover time we should consider to keep the underlying interfaces enabled (because it takes multiple seconds to enable multiple 10 Gbps SPF+s) and disable/enable the virtual bonding interface only in case of a backup->master, master->... event.

Maybe you have an idea of how to archive this goal in an elegant way.

-FF

nathanfaber commented 4 years ago

This does sound disruptive to your environment. Out of curiosity, did you try leaving the bonding disable/enable entirely within the startup? It seems like the /interface ethernet disable [find] is taking too long vs. how long it takes for your interfaces to come up, which creates your problem.

Can you test something like the following:

/interface bonding disable [find]
/interface ethernet disable [find]
/interface bonding enable [find]

And remove all of the other bonding enable/disable? It basically gives us a small window for messing with the bond during bootstrapping and by the time we re-enable them, the underlying interfaces SHOULD be disabled.

I'm really trying to avoid tampering with other components if we can. One problem with this is that we are disabling and enabling all bonds, even ones that the user may have intended to disable. I know we do it with the ethernet interfaces but I'd like to reduce the places that we do this, otherwise we run into "gotchas" that the user wasn't expecting to happen.

I don't really see a nice way keep the interfaces enabled for a faster failover, we need to cleanly isolate the backup and leaving them enabled allows for too many overlay configurations that we would then need to deal with (EoIP, etc).

Appreciate all of the feedback, hopefully we can narrow in on a nice solution that works for everyone.

nathanfaber commented 4 years ago

One other comment on "keeping interfaces up". I considered this years and and played around with the idea but never came up with something that felt solid. I THINK one way we might be able to pull it off is with global firewall rules that filter out all ingress/egress on the non-$haInterface interfaces. In theory, this would allow layer 1 to come up and drop all other communication. It also would solve the enable/disabling problem where we end up enabling interfaces that the user intended to disable. Any thoughts on this design?

Also, because of the cloned MACs, that may introduce some problems with this design depending on the upstream switches.

It would be nice if Mikrotik had some sort of soft disable, that keeps layer 1 up but prevents all communication on a per-interface basis.

nathanfaber commented 4 years ago

One other comment on "keeping interfaces up". I considered this years and and played around with the idea but never came up with something that felt solid. I THINK one way we might be able to pull it off is with global firewall rules that filter out all ingress/egress on the non-$haInterface interfaces. In theory, this would allow layer 1 to come up and drop all other communication. It also would solve the enable/disabling problem where we end up enabling interfaces that the user intended to disable. Any thoughts on this design?

Also, because of the cloned MACs, that may introduce some problems with this design depending on the upstream switches.

It would be nice if Mikrotik had some sort of soft disable, that keeps layer 1 up but prevents all communication on a per-interface basis.

I briefly tested the firewall idea again just now and while it does work, there are MAC issues with PortSec: %ETH-4-HOST_FLAPPING on my Arista MLAG rigs. So not sure this is going to be a good approach.

fflo commented 4 years ago

I don't think it's a good idea trying to block outbound packets on layer-3 (firewall) instead of working with layer-2.

As a replacement for enabling/disabling the physical interfaces the cleanest or most performant solution would be to work with logical interfaces only which are bound to one or more physical interfaces.

For example bond or bridge with special name syntax: HA_...

nathanfaber commented 4 years ago

I don't think it's a good idea trying to block outbound packets on layer-3 (firewall) instead of working with layer-2.

As a replacement for enabling/disabling the physical interfaces the cleanest or most performant solution would be to work with logical interfaces only which are bound to one or more physical interfaces.

For example bond or bridge with special name syntax: HA_...

This would force everyone to configure their devices in a special way. Many people use multiple ethernet interfaces with ha-mikrotik and the idea is to keep the configuration completely "normal" so nobody has to think about HA. I wouldn't want to require that everyone uses a logical interface to overlay the physical for it to work correctly.

fflo commented 4 years ago

I agree, how about moving this idea into a separate branch?

It's a shame that Mikrotik has no official support for this topic: RouterOS should at least natively support syncing the connection tracking table and IPsec states between two devices.

nathanfaber commented 4 years ago

We could branch it. How do we deal with the duplicate MAC problem though? We could reset to the original mac but then we need to deal with gratuitous arp.

I know the convergence time has room for improvement but it generally works well for me and others, it takes a few seconds but it is a rare event here. I’m interested in figuring out a clean fix for your bonding problem without having bonding enable/disable everywhere. Any chance you can test the idea from the earlier comment? I can’t easily test it right now in my environment with bonding setup this way.

fflo commented 4 years ago

(1) Why do you expect issues with duplicate mac? There is no need to modify the original mac address of the hardware interfaces working with bond and/or bridge logical interfaces only. Only the logical interfaces need to share the same mac address per pair as these mac addresses are used for communication. Or did I miss something?

(2) If you like I can test a modified code not disabling the bond interface(s). The CCR cluster is not ready for production, yet. Right now I am working to further reduce the failover downtime of the dynamic routing convergence time using optimized BGP with BFD instead of using modified OSPF (and OSPFv3) settings.

nathanfaber commented 4 years ago

Are you not using auto-mac on your bridges/bonds? I was pretty sure RouterOS selects MAC based on one of the underlying interfaces. I know it can be explicitly overridden but that is extra admin work/gotcha.

Yes, if you could test the simplified bond enable/disable and let me know, that would be great.

What is your convergence time right now with the pair?

fflo commented 4 years ago

Retried simplified bond enable/disable configure, but it does not work on CCR1072-1G-8S+. Every Standby reboot causes traffic interruption for about 5 seconds.

The convergence time depends on the interface configuration because each enable/disable operation on an sfp+ interface requires ~2-3 seconds.

Using a bonding of the first four spf+ with min-links 1 and using dynamic routing BGP4 with BFD I was able to reduce the convergence time from ~30s to about 10s using with some more optimizations: https://github.com/svlsResearch/ha-mikrotik/compare/master...fflo:master

nathanfaber commented 4 years ago

Does every standby reboot cause a 5s interruption currently or one of your changes resolve this?

fflo commented 4 years ago

Using my changes standby reboots and HARoleSwitch it's running smoothly.

But you asked me to re-test the simplified bond enable/disable on ha_startup only and using this simplified setup causes ~5s traffic interruption on each Standby reboot: In detail once every time a configuration change has been loaded or at least once a day at 5 am.

If you look at my latest changes I have furthermore added the following optimizations and features:

  1. optional setup of an additional rescue interface with comment HA_RESCUE
  2. optimize enable/disable interface commands
  3. increase HASwitchRole $haWaitCount from 5s to 10s (should be probably even: 8x3s=24s for some CCR configurations using all sfp+ interfaces as dedicated interfaces)
  4. tune "delaying1 for hardware..." bugfix for CCR equipment

Furthermore, we should look for a more elegant solution to disable dynamic routing configurations in "on-backup" mode, to avoid the logs of the Standby being flooded by senseless error messages and to reduce convergence time in "on-master" event.

nathanfaber commented 4 years ago

Gotcha. Do you understand what went wrong with the disable/enable pattern? If bond is disabled as early as it is in your working patch and then the ethernet is disabled and then bond re-enabled, I don’t see how the interruption happens?

For the dynamic routing, do you think this can be sufficiently handled by the currently supported on_master and on_standby callbacks? We can also add additional callbacks for alternate places that may make your setup functional and easier to keep in sync with my master.

fflo commented 4 years ago

setup up bonding enforces a change of the original mac-address of the physical spf+ interfaces; it seems to me that this change internally re-occurs on each reboot causing the physical devices to re-flap (on/off) for 1-2 seconds.

Re-enabling the bonding after "ha_startup step 0.3: /interface ethernet disable [find disabled=no]" still causes trouble on CCR equipment: The master is receiving packets originating from itself for some seconds and dynamic routing protocols like OSPF or BGP with BFD cause flapping events.

Of course, this should not happen to ethernet devices being soft deactivated; probably a design bug.

It seems cleaner to me to keep bonding interfaces disabled until they are needed in the "on-master" event.

fflo commented 4 years ago

Reading the latest changelog for v7, Mikrotik seems to have added connection tracking synchronization support to VRRP setups.

Maybe it's worth putting this project topic into the v7 beta forum to get more official support.

fflo commented 4 years ago

With regard to dynamic routing: Yes, on_master and on_standby callbacks are fine.

Do you have a suggestion on how to re-enable in "on-master" event only peers and interfaces which have not been soft-deactivated in the master configuration?

nathanfaber commented 4 years ago

It feels like a lot of the issues you are running into on that 8S+ is due to how slow it is to disable the SFP interfaces (as you described). I wonder if we can get them disabled quicker by using :execute to run in the background and try to get all interfaces to be disabled in parallel? I'm not exactly sure how the RouterOS scripting engine is implemented and if it is actually multi-threaded with respect to the state or if there is a global lock for mutation of the state.

For tracking what was disabled/enabled before we mess with it, I have previously injected comments that give me some state information. We could append a comment to ones that we disable (but found enabled by the boot) and then remove that comment when we become master. It is slightly awkward to do substring replacement in RouterOS but we might be able to make a helper function that does this arbitrarily for any configuration that has a disabled and a comment property.

nathanfaber commented 4 years ago

Reading the latest changelog for v7, Mikrotik seems to have added connection tracking synchronization support to VRRP setups.

Maybe it's worth putting this project topic into the v7 beta forum to get more official support.

That does look interesting. To be honest, Mikrotik has offered minimal assistance when we have uncovered bugs that broke ha-mikrotik. At this point, I've just worked around whatever they deliver/break. I'm guessing they will eventually make this project entirely obsolete with some eventual v7 (v8?) enhancement, which is fine by me.

nathanfaber commented 4 years ago

I don't have an 8S+ to test with but I have a 1S+ and 2S+, I'm not seeing disabling taking a ton of time but I am seeing enabling taking something like what you describe. Is this also what you see or are you seeing disabling taking a few seconds as well?

With 8 interfaces, I can definitely see how you would see extra time during a role switch. Do you want to try to integrate the below parallel enable/disable? It will be far more obvious how it behaves on your 8S+ vs. mine.

[admin@X_Inet_HA_A_STANDBY] > [:put [/system clock get time]]; /interface set 0 disabled=yes; [:put [/system clock get time]];   
13:39:46
13:39:46
[admin@X_Inet_HA_A_STANDBY] > [:put [/system clock get time]]; /interface set 0 disabled=no;  [:put [/system clock get time]];
13:39:49
13:39:51
[admin@X_Inet_HA_A_STANDBY] > [:put [/system clock get time]]; /interface set 0 disabled=yes; [:put [/system clock get time]];
13:39:53
13:39:53
[admin@X_Inet_HA_A_STANDBY] > 

Basic test code for parallel disable (or enable):

:foreach k in=[/interface ethernet find] do={:local name [/interface ethernet get $k name]; :execute "/interface ethernet set [find name=\"$name\"] disabled=no"}

:foreach k in=[/interface ethernet find] do={:local name [/interface ethernet get $k name]; :execute "/interface ethernet set [find name=\"$name\"] disabled=yes"}
fflo commented 4 years ago

Thanks for your hint.

It does not seem to work running the interface commands in parallel, or at least there is no difference in execute timing:

[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; :foreach k in=[/interface ethernet find where default-name!="$haInterface" and comment!="HA_RESCUE"] do={:local name [/interface ethernet get $k name]; :execute "/interface ethernet set [find name=\"$name\"] disabled=no"}; [:put [/system clock get time]];
11:30:26
11:30:34
[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; :foreach k in=[/interface ethernet find where default-name!="$haInterface" and comment!="HA_RESCUE"] do={:local name [/interface ethernet get $k name]; :execute "/interface ethernet set [find name=\"$name\"] disabled=yes"}; [:put [/system clock get time]];
11:30:34
11:30:36
[fflo@...CCR01_HA_A_STANDBY] >
[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; /interface ethernet enable [find where default-name!="$haInterface" and comment!="HA_RESCUE"]; [:put [/system clock get time]];
11:30:56
11:31:05
[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; /interface ethernet disable [find where default-name!="$haInterface" and comment!="HA_RESCUE"]; [:put [/system clock get time]];
11:31:05
11:31:06
[fflo@...CCR01_HA_A_STANDBY] >
nathanfaber commented 4 years ago

Thanks for your hint.

It does not seem to work running the interface commands in parallel, or at least there is no difference in execute timing:

[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; :foreach k in=[/interface ethernet find where default-name!="$haInterface" and comment!="HA_RESCUE"] do={:local name [/interface ethernet get $k name]; :execute "/interface ethernet set [find name=\"$name\"] disabled=no"}; [:put [/system clock get time]];
11:30:26
11:30:34
[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; :foreach k in=[/interface ethernet find where default-name!="$haInterface" and comment!="HA_RESCUE"] do={:local name [/interface ethernet get $k name]; :execute "/interface ethernet set [find name=\"$name\"] disabled=yes"}; [:put [/system clock get time]];
11:30:34
11:30:36
[fflo@...CCR01_HA_A_STANDBY] >
[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; /interface ethernet enable [find where default-name!="$haInterface" and comment!="HA_RESCUE"]; [:put [/system clock get time]];
11:30:56
11:31:05
[fflo@...CCR01_HA_A_STANDBY] > [:put [/system clock get time]]; /interface ethernet disable [find where default-name!="$haInterface" and comment!="HA_RESCUE"]; [:put [/system clock get time]];
11:31:05
11:31:06
[fflo@...CCR01_HA_A_STANDBY] >

I can confirm what you are seeing with this code. It seems like the first execute is going to the "background" and then the second one blocks.

Can you try this? I changed it to the brace syntax and added a delay, the delay seems to make a difference to have it going into the "background" (I don't get it). It seems to push these all to the background for me but I haven't confirmed if the interfaces are actually coming up faster. It also prints each name and then the background jobs at the end.

Standby...don't run it. I don't think the $name is propagating correctly in this code.
nathanfaber commented 4 years ago

Follow up to above comment with code that propagates $name correctly (local variables don't appear to propagate inside another :execute block): In my testing...this is backgrounding it and the foreground returns faster but based on what I am seeing in the logs, I don't think the interfaces come up any faster, I think there is a global lock. It will be more obvious with your 8x though.

[:put [/system clock get time]]; /interface ethernet disable [find where default-name!="$haInterface" and comment!="HA_RESCUE"]; [:put [/system clock get time]];
[:put [/system clock get time]]; :foreach k in=[/interface ethernet find where default-name!="$haInterface" and comment!="HA_RESCUE"] do={:local name [/interface ethernet get $k name]; :put $name; :execute "/delay 0.1; /log warning \"start: $name\"; /interface ethernet enable [find name=\"$name\"]; /log warning \"end: $name\""}; [:put [/system clock get time]]; /system script job print
fflo commented 4 years ago

Yes, there is a global lock

01:40:53 system,info,account user fflo logged in from 00:00:5E:00:01:01 via mac-telnet
01:43:31 system,info device changed by fflo
01:43:31 system,info device changed by fflo
01:43:31 system,info device changed by fflo
01:43:31 system,info device changed by fflo
01:43:31 system,info device changed by fflo
01:43:31 system,info device changed by fflo
01:43:31 system,info device changed by fflo
01:43:31 script,warning start: sfp-sfpplus1
01:43:31 script,warning start: sfp-sfpplus2
01:43:31 script,warning start: sfp-sfpplus3
01:43:31 script,warning start: sfp-sfpplus5
01:43:31 script,warning start: sfp-sfpplus4
01:43:31 script,warning start: sfp-sfpplus7
01:43:31 script,warning start: sfp-sfpplus6
01:43:32 system,info device changed by fflo
01:43:32 script,warning end: sfp-sfpplus1
01:43:34 interface,info sfp-sfpplus2 link down
01:43:34 system,info device changed by fflo
01:43:36 interface,info sfp-sfpplus3 link down
01:43:36 script,warning end: sfp-sfpplus2
01:43:36 system,info device changed by fflo
01:43:39 interface,info sfp-sfpplus4 link down
01:43:39 system,info device changed by fflo
01:43:39 script,warning end: sfp-sfpplus3
01:43:39 script,warning end: sfp-sfpplus5
01:43:39 system,info device changed by fflo
01:43:39 script,warning end: sfp-sfpplus4
01:43:39 system,info device changed by fflo
01:43:39 script,warning end: sfp-sfpplus7
01:43:40 system,info device changed by fflo
01:43:40 script,warning end: sfp-sfpplus6
01:43:40 interface,info sfp-sfpplus2 link up (speed 10G, full duplex)
01:43:40 interface,info sfp-sfpplus3 link up (speed 10G, full duplex)
01:43:40 interface,info sfp-sfpplus4 link up (speed 10G, full duplex)

[fflo@...CCR01_HA_B_STANDBY] /interface ethernet>
nathanfaber commented 4 years ago

Yea, this is a bummer. The convergence could be sped up by only bringing up interfaces that are needed. I guess this would be about a 2x speed up for you on the enabling part (only enabling the right 4). I can see why you are interested in finding a solution to keep the interface link up.

nathanfaber commented 4 years ago

Also, just speaking out loud. You have 4 links because it is 2 to an A switch and 2 to a B switch, is that right?

In theory, if we had a nice way, you could bring up 1 link on A and 1 link on B and then bring up the other 2 after everything else has been enabled. LACP should be happy to transparently bring up the other set of links a bit later. This would be 4x speed up.

Trying to keep the interfaces enabled is not lost on me though, just brainstorming.

fflo commented 4 years ago

The configured bonding setup should in theory already starts working with one sfp+ interface connected because only "1 of 4" is set up as a requirement for the bonding interface to start operation.

But due to the nature of IRF LinkAggregation, you are right that both switches should get a dedicated interface up and running a soon as possible because inner IRF routing is suboptimal.

The overall convergence time takes some more seconds because dynamic routing and route announcements of the BGP4 sessions need some time to reset and re-establish on $HASwitchRole or device outage.

To optimize it I have adapted the BGP4 timers to 15s hold-time and 5s keep-alive + configured BGP-BFD.

Not having to wait for enabling the physical member interfaces of a bonding promises a huge performance gain in convergence time.

Do you have a hint on how to find the physical member interfaces of a bonding (only) and enable these interfaces in ha_onbackup state already?

nathanfaber commented 4 years ago

We may actually be on to something with using :execute to get the background enabling of interfaces. This will allow the foreground to continue to proceed with your current methods of enabling the bond and BGP much sooner than it would be if we block for all 8 interfaces.

If you were to interleave your links (sfp1 -> A1, sfp2 -> B1, sfp3 -> A2, sfp4 -> B2) then this would bring up the two minimum ideal interfaces to the switch pairs as fast as we currently can (while the rest of the on_master script is executing in parallel). It requires that you physically lay it out such that this becomes optimal but it doesn't seem too bad of a trade off.

If you do want to pull the slaves, try something like: [:foreach k in=[/interface bonding get [find name="bond1"] slaves] do={:put $k}]

nathanfaber commented 4 years ago

Furthermore, since we know about the apparent global lock now, there is no point running it in parallel. We can simply do a single :execute to enable the interfaces with a [find] rather than the individual executes and allow the callbacks to run sooner.

ie: ha_onmaster :execute "/interface ethernet enable [find]" and then using on_master to enable your bonds and BGP.

It is all a bit dirty right now but I feel we might not be far from figuring out a way to do it generically so it works for you and everyone else as it does now.