svlsResearch / ha-mikrotik

High availability code for Mikrotik routers
155 stars 44 forks source link

RouterOS 6.43.8 instability (testing to get 6.43.13 functioning) #7

Closed nathanfaber closed 5 years ago

nathanfaber commented 5 years ago

There seems to be an issue with 6.43.8. Pairs have been unstable and rebooting periodically. Occasionally, I have had issues accessing the RouterOS console (web, ssh, serial). It accepts the login and then stalls and never gives the prompt (last line "/command Use command at the base level").

Unclear what is happening. I do not recommend running 6.43.8 outside of testing.

nathanfaber commented 5 years ago

Also seeing:

Topics |   | snmpwarning
-- | -- | --
Message |   | timeout while waiting for program 48
nathanfaber commented 5 years ago

6.43.8 seems to be completely unstable with ha-mikrotik. I am now testing with 6.42.11 (long-term), which appears to be stable so far.  

the-nicolas commented 5 years ago

What about newer versions like 6.44.1? Any experience so far?

nathanfaber commented 5 years ago

What about newer versions like 6.44.1? Any experience so far?

I have only tested up to 6.42.11. I will try to test a newer version in the next week or two and report back.

nathanfaber commented 5 years ago

I have deployed 6.43.13 (long-term) to a pair and I will report back if it appears stable.

nathanfaber commented 5 years ago

6.43.13 worked well overnight and has now been deployed to another production pair.

I am also doing a reboot sync/check/loop and currently at 28 iterations. So far, so good.

Reboot loop code:

:for pushCount from=1 to=10000 do={
   :put "$pushCount pushing"
   $HAPushStandby
   :put "$pushCount push done"
   :delay 120
   $HASyncStandby;
   :put "$pushCount sync ok"
   :delay 10
}
nathanfaber commented 5 years ago

@the-nicolas Are you actually interested in 6.44.1 or will the 6.43.13 that I am testing suffice? I don't have any features that I am seeking in 6.44 right now so I'd stay on 6.43 for testing unless there is demand for the other branch.

nathanfaber commented 5 years ago

Using 6.43.13, I believe I have detected firing on a scheduler event for start-time=startup that is added after the boot but not expected to execute yet until next reboot. This is bad as I rely on this event to only fire after reboot. I am going to make a post on the the Mikrotik forums about this to see if they can look into it. This causes a role switch when there shouldn't be one. I am continuing to try to reproduce it.

This is one of the problems I had with 6.43.8, hopefully someone at Mikrotik can confirm why this is happening.

nathanfaber commented 5 years ago

Added a patch ecf52e82215a247eede1fae090b237ab8f43e573 to prevent ha_startup from running twice. I am continuing to test. It is not recommend to run 6.43.13 with the existing releases, you will need to wait for the next release or run the current master.

nathanfaber commented 5 years ago

It looks like the routers are stable once booted and properly initialized but there is another rare race with interfaces not showing up during initialization. It looks like we are not seeing the interface at periodically: https://github.com/svlsResearch/ha-mikrotik/blob/ecf52e82215a247eede1fae090b237ab8f43e573/scripts/ha_startup.script#L61

I suspect it vanishes briefly during the enable a few lines earlier and then comes back. I have caught the standby in this state 2 times during 100 or so reboots, so it isn't that easy to run into.

I am doing a reboot loop to try to catch it with some additional logging but I suspect the fix will be similar to what I did earlier in the initialization (wait for it to show up again): https://github.com/svlsResearch/ha-mikrotik/blob/ecf52e82215a247eede1fae090b237ab8f43e573/scripts/ha_startup.script#L19-L22

nathanfaber commented 5 years ago

I am spinning 3 pairs on $HALoopPushStandby using a8e378fd34949baff275169e86018caedefbbd58. One pair is on 6.42.11 and the other two are on 6.43.13.

If everything stays stable over the next 24 hours with these, I will stamp a release.

nathanfaber commented 5 years ago

Ended up reworking the initialization code into a retry loop. Tested it on 3 pairs for 12 hours. Now testing 64a7c8a on 5 pairs. Feeling pretty good about the current build for final release but continuing to test. 5 pairs: 2 x 6.42.11 3 x 6.43.13

nathanfaber commented 5 years ago

Extending pairs test to include 6.44.1: 5 pairs 2 x 6.42.11 2 x 6.43.13 1 x 6.44.1

nathanfaber commented 5 years ago

Completed over 100 standby cycles on each those 5 pairs without issue. Now testing general stability without any manual forced pushes. Assuming they all survive the next 24h, I will stamp the release.

nathanfaber commented 5 years ago

Release is stamped for rc1. Assuming no other issues, this will be the final release of 0.6. If you can test it, please do. https://github.com/svlsResearch/ha-mikrotik/releases/tag/v0.6rc1

nathanfaber commented 5 years ago

Do not proceed with the upgrade. There is an issue after ~24 hours of runtime with the new RouterOS that I am trying to debug.

Problem is with RouterOS (old versions still appear stable with new ha-mikrotik) but newer ones have a problem.

nathanfaber commented 5 years ago

After around 16 hours of uptime (16:23-16:33) I am seeing "ERROR ATTEMPTED TO RUN AGAIN" protection logging on all pairs running 6.43.13 and 6.44.1 (6.42.11 continues to be fine and stable). Somehow, RouterOS is deciding to run the startup scheduler events that should be running on next boot.

Forgot to say...after this happens, we lose the entire environment (/environment print), which is very bad.

I am suspecting it is another component that I run on RouterOS that isn't part of ha-mikrotik that is causing this, so it may not be an issue for others, but I want to confirm this before I let it go wild.

nathanfaber commented 5 years ago

99% convinced that the environment loss issue is due to another script that fires every 1m on my systems and checks the health of various netwatches. The script has not been completing and there is no check to see if another is running, so they accumulate in running jobs. There were over 200 of them running on the ones I was able to observe, that had the lost environment. I think the RouterOS scripting/interpreter process is crashing due to out of memory and restarting, causing the environment to be lost and the startup to happen again.

This component has nothing to do with ha-mikrotik and has some other issue with 6.43.13/6.44.1.

Based on this, I still believe the current ha-mikrotik is functional and stable with all versions but until I can do some more testing with this other component disabled, I am going to hold off.

nathanfaber commented 5 years ago

Further confirmation of the memory exhaustion theory - last 5 days since RouterOS upgrade, memory slowly ramping up. I have disabled the script that I believe is responsible to to see if memory stays stable.

image

nathanfaber commented 5 years ago

The issue with my other script has been tracked down to: :put [/ip route check 1.2.3.4 without-paging as-value] On 6.42.11, this returns immediately as if once is added. On 6.43.13 and 6.44.1, this never returns. Fix is simple, add once. :put [/ip route check 1.2.3.4 without-paging as-value once]

So this other script was definitely continuing to spawn until something went wrong with the RouterOS interpreter.

Again, this problem had nothing to do with ha-mikrotik but this script runs on machines that I also run ha-mikrotik, which caused an overall problem. Testing continues, still expecting to stamp this current master for release without change.

nathanfaber commented 5 years ago

Crossed the 16h mark on all the pairs, they look much better. Release will be stamped on Monday. image

nathanfaber commented 5 years ago

Everything remains stable. Stamping the release and closing this issue.