svlsResearch / ha-mikrotik

High availability code for Mikrotik routers
155 stars 44 forks source link

Both Router Assume Role of A_Active #1

Closed ibanezbass closed 7 years ago

ibanezbass commented 7 years ago

I'm running the latest 6.40.1 code. It appears, after upgrading, both of my routers assume the role of A_Active. I've tried wiping the secondary and doing a fresh $HAInstall, but it fails after a few reboots and thinks it is the A router again.

nathanfaber commented 7 years ago

What do they each think of the vrrp state? Do they see each other at all? I am only running/testing up to 6.39 right now as the 6.40 has some fundamental bridging changes that have had me avoid it for now. I can potentially give 6.40 a poke on a test setup though.

nathanfaber commented 7 years ago

Can you tell me a little bit more about your setup? Was it working correctly with a prior version?

I just upgraded a site to 6.40.1 and from a quick examination - it is behaving as I would expect. I was able to $HASyncStandby along with reboot and have them switch roles. No sign of active/active.

nathanfaber commented 7 years ago

I take it back - there is an issue that has popped up after cycling a few times. ha_startup seems to be failing. Can you look at /log print on both units and see if you see anything like this:

15:01:43 script,warning ha_startup: 2 15:01:43 script,error ha_startup: FAILED - DISABLED ALL INTERFACES

I will debug it some more.

ibanezbass commented 7 years ago

Sorry about the delay. Yes, I did see the ha_startup failure on mine. It appears it will behave normally sometimes, and then fail others. I'm glad you are also seeing the issue, though. Thanks for the quick responses!!

ibanezbass commented 7 years ago

On mine, I've got two 2 SFP+ interfaces bonded in 802.3ad mode on each router, and then port 8 as the VRRP interface between them. When the startup script fails, it allows both to become active and it leaves the routers in a semi-up/down state until I unplug one, or disable the interfaces on the upstream switch for one.

nathanfaber commented 7 years ago

I've only got it to happen once and it appears to be another variation of the else condition in ha_startup. From the logging messages, it seems like this line of code is failing: :local mac [[/interface ethernet get [find default-name="$haInterface"] orig-mac-address]] The other odd part is that the failsafe to disable all of the interfaces did not work, which is a similar command: /interface ethernet disable [find default-name!="$haInterface"];

Original else case when I noticed that the CCR would come up with odd MACs:

  #This is a very strange bug...maybe just in the CCR? Sometimes when the unit comes up, ethernet interfaces sometimes have swapped positions?
  #A reboot clears this error - it is very odd, I don't know if it is a race condition in hardware initialization or something.
  #I'm not sure this covers ALL cases, since it only checks the MAC of the one interface our HA runs over. It might not right now :(
  #Do we need to track all MACs to make sure they are in the right order? This seems like a general problem with the platform but I don't understand the extent of it.
  #Am I causing this?
  :global haIdentity "UKNOWN"
  /log warning "I AM UNKNOWN - WRONG MAC"
  /delay 15
  :execute "/system reboot"
  /delay 1000

Are you on a CCR? How many times have you been able to get it to happen and do you have a pattern that triggers it or is it random like I am seeing?

ibanezbass commented 7 years ago

I am on CCR. CCR1036-8G-2S+ to be exact. It appears to be random. If I reboot a couple of times, it will be hit or miss whether it works.

nathanfaber commented 7 years ago

Did you happen to check the logs in the cases that you saw it happen? Do you think it hit "ha_startup: FAILED - DISABLED ALL INTERFACES" each time?

ibanezbass commented 7 years ago

I know for certain that I've seen the ha_startup failed every time I've looked at the log today. I don't know if it was every time that it failed to work though, as I wasn't specifically looking every time it failed. On my current master, one of the first log lines is the ha_startup failed line.

nathanfaber commented 7 years ago

It sounds like your current master is not in a good state then. Can you see if /interface vrrp print shows any VRRP interfaces?

ibanezbass commented 7 years ago

[admin@MikroTik_HA_A_ACTIVE] > interface vrrp print Flags: X - disabled, I - invalid, R - running, M - master, B - backup

NAME INTER... MAC-ADDRESS VRI PRI INTERVAL V V3..

0 RM HA_VRRP ether8 00:00:5E:00:01:01 1 100 1s 3 ipv4

ibanezbass commented 7 years ago

That pasted weird.... But, it looks like the VRRP interface state is good on the Master.

nathanfaber commented 7 years ago

Are there any other ha_startup lines on that master that are still visible? It outputs a number to give a rough sign of where it is in the code before a failure.

ibanezbass commented 7 years ago

08:37:50 script,warning ha_startup: START 08:37:50 script,warning ha_startup: 1 08:37:50 script,warning ha_startup: 2 08:37:50 script,error ha_startup: FAILED - DISABLED ALL INTERFACES

ibanezbass commented 7 years ago

This is my standby, currently. It's not working, and I have no HA VRRP interface.

[admin@MikroTik_HA_B_STANDBY] > interface vrrp print Flags: X - disabled, I - invalid, R - running, M - master, B - backup

NAME INTER... MAC-ADDRESS VRI PRI INTERVAL V V3..

[admin@MikroTik_HA_B_STANDBY] >

ibanezbass commented 7 years ago

After rebooting...

Flags: X - disabled, I - invalid, R - running, M - master, B - backup

NAME INTERFACE MAC-ADDRESS VRID PRIORITY INTERVAL VERSION V3-PROTOCOL

0 B HA_VRRP ether8 00:00:5E:00:01:01 1 100 1s 3 ipv4

nathanfaber commented 7 years ago

I just captured it happening along with additional debugging. It seems there is a new race condition for startup scripts and the initialization of the Ethernet interfaces. It looks like if ha_startup runs too soon, there are no Ethernet interfaces, which is why all of these commands are failing. I am going to take a look at adding a check that spins until the interfaces are there.

This is what /interface print detail looks like just after ha_startup: 2 and before the failure, on this instance:

Flags: D - dynamic, X - disabled, R - running, S - slave 0 name="Inside" type="vlan" mtu=1500 mac-address=E4:8D:8C:XX:XX:XX link-downs=0

1 name="Internet" type="vlan" mtu=1500 mac-address=E4:8D:8C:XX:XX:XX link-downs=0

Note that the lack of physical interfaces. They show up shortly after but not before ha_startup bails out.

ibanezbass commented 7 years ago

Interesting. That makes sense now. Seems like a bug in MikroTik's new code to run scripts before the hardware is fully initialized.

nathanfaber commented 7 years ago

Marking it as a bug even though it is really a problem with RouterOS/Mikrotik. I have had limited success getting them to acknowledge problems like this so I will work around it and make the failsafe more robust.

ibanezbass commented 7 years ago

Thanks for all your help!

nathanfaber commented 7 years ago

If you want to see what it looks like on your unit: Get both units happy such that you can $HASyncStandby from the master. Edit ha_startup script on the master and add this line: :execute "/interface print detail" file="HA_boot_interface_print.txt" after this line: /log warning "ha_startup: 2"

Run $HASyncStandby so it propagates to the standby. Reboot the standby until you get the ha_startup failure and then run: [:put [/file get HA_boot_interface_print.txt contents ]]

I will be adding that additional interface print detail to the next version as well.

nathanfaber commented 7 years ago

Giving them a heads up: https://forum.mikrotik.com/viewtopic.php?f=1&t=124415

nathanfaber commented 7 years ago

I have put a fix in 0ae62da and this is what it looks like when this was failing (it takes ~1s for the hardware to initialize after ha_startup was first executed). You will see these "delaying for hardware" if it would have been a failed boot in the prior ha-mikrotik version. Some boots come up without any delay.

I am a tad concerned that the hardware initialization may not be entirely atomic (ie: ether5 is ready but ether8 is not). I'm not sure if a clean way to detect that all of the interfaces on the platform are ready, can we get the max ether interface programmatically? I will investigate this a bit more but I think this will likely just fix the issue.

Please let me know if this resolves it for you.

dec/31/1969 20:00:05 system,info router rebooted aug/09 22:51:08 script,warning ha_startup: START aug/09 22:51:08 script,warning ha_startup: 0.1 aug/09 22:51:08 script,warning ha_startup: 0.2 aug/09 22:51:08 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:08 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 script,error ha_startup: delaying for hardware...cant find ether5 aug/09 22:51:09 interface,info Inside link up aug/09 22:51:09 interface,info Internet link up aug/09 22:51:09 script,warning ha_startup: 0.3 aug/09 22:51:09 vrrp,info HA_VRRP now BACKUP

ibanezbass commented 7 years ago

I've uploaded the code to my stack and rebooted the secondary. I was able to see the delay, waiting for hardware. Tomorrow, when I'm back on site with the equipment, I will kick on the interfaces on the upstream switches and see what happens on a reboot. Thank you so much for the quick response to this issue! Your script has been a great help in my environment.

ibanezbass commented 7 years ago

Yes, it is working perfectly now. Now I just need to change my spanning-tree protocol to reduce the outage time. 👍 :)

nathanfaber commented 7 years ago

The fix looks good after 200+ cycles.

Future notes.... Programmatically rebooted the standby every 90 seconds for a few hours: 202 total reboots 175 came up without delay 27 had "delaying for hardware" ranging from 2x to 10x @ 100ms. 0 failed "FAILED - DISABLED ALL INTERFACES"

CCR1009-8G-1S-1S+ RouterOS 6.40.1 Firmware 3.39

ha_rebootspin scheduled @ startup:

:delay 90
:local badCount [:len [/log find where message~"ha_startup.*(FAILED)"]]
:local goodCount [:len [/log find where message~"ha_startup.*(DONE)"]]
:local delayCount [:len [/log find where message~"ha_startup.*(delaying)"]]
/log info "ha_startup: ha_rebootspin debug badCount=$badCount goodCount=$goodCount delayCount=$delayCount"
:global isMaster
:global haAddressOther
:execute "/log print" file="HA_boot_log.txt"
:if ($isMaster) do {
   #Just because we are master doesnt mean we really are, we could have a failed startup but it is too risky to do anything else.
   :put "I am master - do nothing"
} else {
   :if ($goodCount = 1) do {
      :put "REBOOT"
      /system reboot
   } else {
      :put "STAY"
      #Disable all interfaces if they havent already, so the primary doesnt sneak in and we lose the failed state.
      /interface ethernet disable [find]
   }
}