svlsResearch / ha-mikrotik

High availability code for Mikrotik routers
156 stars 45 forks source link

Update ha_startup.script #22

Closed reinistihovs closed 2 years ago

reinistihovs commented 3 years ago

Mikrotik scripting engine still executes this part, the scripting engine ignores the #tag because of the curly braces following it.

Output from HA_boot_log after a crash:

02:40:03 script,warning ha_startup: ERROR ATTEMPTED TO RUN AGAIN!!! 1d01:40:30 2:40:08 interface,info sfp-sfpplus1 link down 02:40:08 system,info device changed by admin 02:40:08 interface,info UPLINK link down 02:40:08 interface,info sfp-sfpplus2 link down 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 interface,info sfp-sfpplus3 link down 02:40:09 system,info device changed by admin 02:40:09 interface,info sfp-sfpplus4 link down 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 interface,info sfp-sfpplus10 link down 02:40:09 interface,info sfp-sfpplus11 link down 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 system,info device changed by admin 02:40:09 interface,info sfp28-1 link down 02:40:10 script,error ha_startup: FAILED - DISABLED ALL INTERFACES 02:40:10 system,info device changed by admin 02:40:10 interface,info sfp28-2 link down

nathanfaber commented 3 years ago

Can you elaborate on what happened here? What platform, what routeros, any details on what got it here?

I don't think that commented code is actually running. I believe this on-error was triggered, in the startup scheduler: https://github.com/svlsResearch/ha-mikrotik/blob/3f3f58bbfa6f7b7ea25019997cdfb4b49d80267d/scripts/ha_startup.script#L52

Looking over this code again, I'm not sure this logic is right for the scheduler being restarted. If the startup runs again because RouterOS crashed some userspace code and restart, it seems this will disable all the interfaces like you are seeing. I'm not actually sure how to successfully recover from this situation and I haven't seen this triggered in a long time.

reinistihovs commented 3 years ago

Hi, Im running on 2x CCR2004-1G-12S+2XS routerOS: 6.47.6

I just found out that you have the same code in the scheduler (on error) to disable all interfaces. So even if routerOS bugs out and launches the task repeatedly, still all ports get disabled.

To fix this, possibly an uptime check is needed here also. I also see that supout.rif is generated at that time, watchdog is enabled with default empty ip field.. but by viewing logs there was no reboot at that time, I only see in log, when i rebooted the router by unplugging power.

nathanfaber commented 3 years ago

Yes, the root of the problem is RouterOS bugging out. It isn't clear to me how to maintain the state of the world when this happens, so it is an awkward problem to deal with.

How long have you been running this platform? Has it been stable on any other versions of RouterOS? How easy has it been to reproduce?

I am going to update a testing pair of mine to 6.47.10 (but on CCR1009-8G-1S-1S+) and see what happens.

reinistihovs commented 3 years ago

This is the olny version that I tested at least for a month its was running stable, only this one bug appeared when configuration got updated with vlans, firewalls and load was introduced. I also implemented commits from: https://github.com/svlsResearch/ha-mikrotik/pull/17 because I also have interface bonding.

also mythicaly both time this happened at night after midnight, but the 2 are not actually the same.

nathanfaber commented 3 years ago

Are you saying there was a bug triggered when you did a large update of vlans/firewalls? Are you able to actively reproduce this problem?

reinistihovs commented 3 years ago

At the moment I removed the code part that disables Interfaces on ha_startup error, I will keep me eyes open.

nathanfaber commented 3 years ago

Hopefully you are using this in a testing setup. When that RouterOS userspace environment fails, you are going to lose all of the global environment that ha-mikrotik uses to keep things in sync. I'm not sure where you will end up in terms of operational behavior - the router will likely work as expected but sync may get a bit wacky. I had a an issue (#9) where I debated forcing a router reboot if this happens, to bring it back to a known state.

Please update as you know more.

reinistihovs commented 3 years ago

Hi, Yes im running in production, as I mentioned I tested this setup over a month in a test environment, had no issues. I made a commit, if the repeadeted bugged task launch is detected, then dont classify it as as :error, just ignore, this way scheduler wont trigger port disabling.

reinistihovs commented 3 years ago

Weirdly, but the issue still didn't repeat, what have changed? 1) I have disabled firewall rule: filter add chain=forward connection-state=established,related action=fasttrack-connection I think this is irrelevant, but maybe you might see some connection here.

2) Changed ICMP flood prevention rules:

add action=jump chain=input comment="Jump for icmp input flow" disabled=no jump-target=ICMP protocol=icmp
add action=jump chain=forward comment="Jump for icmp forward flow" disabled=no jump-target=ICMP protocol=icmp
add action=accept chain=ICMP comment="Echo request - Avoiding Ping Flood, adjust the limit as needed" disabled=no icmp-options=8:0 limit=2,5 protocol=icmp
add action=accept chain=ICMP comment="Echo reply" disabled=no icmp-options=0:0 protocol=icmp
add action=accept chain=ICMP comment="Time Exceeded" disabled=no icmp-options=11:0 protocol=icmp
add action=accept chain=ICMP comment="Destination unreachable" disabled=no icmp-options=3:0-1 protocol=icmp
add action=accept chain=ICMP comment=PMTUD disabled=no icmp-options=3:4 protocol=icmp
add action=drop chain=ICMP comment="Drop to the other ICMPs" disabled=no protocol=icmp
add action=jump chain=output comment="Jump for icmp output" disabled=no jump-target=ICMP protocol=icmp

I didn't have interface filters on those rules, so now I filtered out, that only ICMP traffic from Internet gets regulated. Possibly this has a connection with Mikrotiks built in watchdog feature, these rules limit ICMP, and maybe somehow watchdog gets triggered, what could possibly trigger scheduler to start startup scripts.

3) I updated the HA-startup script and Scheduler task action, Possibly by updating the task, the scheduler bug fixed it self, and will reappear in future :)

reinistihovs commented 3 years ago

I have done more digging and found this thread: https://forum.mikrotik.com/viewtopic.php?t=122872 both times when this occured there was an autosupout.rif file present, There are no clear indications from Mikrotik, when and why this happens, but as stated in that thread, it even could be a crashed winbox session, or some hickup in kernel. But its hard to know which one was first, autosupout.rif or buggy startup task triggered what actually caused autosupout.rif to be generated.

nathanfaber commented 3 years ago

I haven't accidentally gotten the RouterOS console to crash in years, I was able to get it to happen quite a few times during the primary development of ha-mikrotik.

That being said, I just hacked up this to crash the console. It may not crash in exactly the same way that you but it does cause the console to crash and be restarted, which casues the startup tasks to be run again.

Please don't run this in a production environment and make sure you have some out of band access or physical access to reboot it.

[admin@XXX_Inet_HA_A_STANDBY] > { :local x "FOOO"; :do { :set x "$x$x"; } while (true); }    

Console has crashed; please log in again.

After logging in again:

10:12:00 script,warning ha_startup: ERROR ATTEMPTED TO RUN AGAIN!!!  00:04:05 
10:12:00 system,info,account user admin logged out from 169.254.23.2 via ssh 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 system,info device changed by admin 
10:12:05 script,error ha_startup: FAILED - DISABLED ALL INTERFACES 

> /environment print
haAllowBootstrap=[:nothing]
haIdentity=[:nothing]
haInterface=[:nothing]
haNetmaskBits=[:nothing]
haStartupHasRun=[:nothing]
isMaster=true
maxDistance=50
minDistance=5
stepDistance=5
uptime=00:04:05

I will play around when I have some more time to see if there is any chance of recovery when this happens.