Open mikliapko opened 3 months ago
@fruch I suppose the ideal solution would be to split this one node with manager-server and monitor into two separate nodes. But how much work it requires to be done and who may take care of this? Perhaps, you may any temporary workarounds to unblock Manager upgrade tests?
cc: @rayakurl
@fruch I suppose the ideal solution would be to split this one node with manager-server and monitor into two separate nodes. But how much work it requires to be done and who may take care of this? Perhaps, you may any temporary workarounds to unblock Manager upgrade tests?
cc: @rayakurl
well, yes we should consider separating for a whole bunch of reason, this is one small example.
but this one still point to an issue on how we operate the monitoring stack. we should move it's start/stop into a systemd configuration, that can survive reboot
or docker setup that can survive reboot.
one of those can fix this issue.
also one can add code to the manager upgrade code, to check if monitoring is up after upgrade, and spin it back up if not.
there's multiple address this issue.
also one can add code to the manager upgrade code, to check if monitoring is up after upgrade, and spin it back up if not.
In such approach it's still might be a chance to get into situation when something is sent to down monitor during upgrade. We should avoid any downtime for it.
Alright, I see that the upgrade flow we use in test is not the same we recommend in Scylla Manager upgrade documentation.
apt-get dist-upgrade scylla-manager-server ...
apt-get install scylla-manager-server ...
So, I suppose we need to update the upgrade flow in test to fit documentation. It should fix the issue described above.
@amnonh
Do we have a way to run/configure the monitoring stack to survive docker service upgrade ? i.e. that it would spin back up when the upgrade is done ?
Have you tried using --auto-restart
command line option to start-all.sh
it may be enough
The problem:
scylla-manager-server and monitoring stack are parts of the same node.
When we perform the upgrade of the Manager running, it results in upgrade of dependencies and docker is among them:
Since docker restarts after upgrade, the monitoring stack becomes unavailable:
Impact: Nothing can be send to monitor since upgrade happened what leads to test failures. Example of job.