scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
52 stars 34 forks source link

[SCT] Manager upgrade leads to monitoring stack unavailability #4112

Open mikliapko opened 3 months ago

mikliapko commented 3 months ago

The problem:

scylla-manager-server and monitoring stack are parts of the same node.

When we perform the upgrade of the Manager running, it results in upgrade of dependencies and docker is among them:

~$ apt-get dist-upgrade --just-print scylla-manager-server

NOTE: This is only a simulation!
      apt-get needs root privileges for real execution.
      Keep also in mind that locking is deactivated,
      so don't depend on the relevance to the real current situation!
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
  linux-aws-6.5-headers-6.5.0-1023 linux-headers-6.5.0-1023-aws linux-image-6.5.0-1023-aws linux-modules-6.5.0-1023-aws ubuntu-pro-client ubuntu-pro-client-l10n
The following packages have been kept back:
  python3-update-manager update-manager-core
The following packages will be upgraded:
  cloud-init containerd.io docker-buildx-plugin docker-ce docker-ce-cli docker-ce-rootless-extras docker-compose-plugin ec2-hibinit-agent git git-man intel-microcode libc-bin libc-dev-bin libc-devtools libc6
  libc6-dev libldap-2.5-0 libldap-common libnetplan0 libpython3.10 libpython3.10-dev libpython3.10-minimal libpython3.10-stdlib libssl3 libtiff5 libtss2-esys-3.0.2-0 libtss2-mu0 libtss2-sys1 libtss2-tcti-cmd0
  libtss2-tcti-device0 libtss2-tcti-mssim0 libtss2-tcti-swtpm0 linux-aws linux-headers-aws linux-image-aws linux-libc-dev netplan.io openssh-client openssh-server openssh-sftp-server openssl python3-idna
  python3-jinja2 python3-zipp python3.10 python3.10-dev python3.10-minimal scylla-manager-client scylla-manager-server snapd tzdata ubuntu-advantage-tools wget xxd

Since docker restarts after upgrade, the monitoring stack becomes unavailable:

CONTAINER ID   IMAGE                                   COMMAND                  CREATED       STATUS                        PORTS     NAMES
40e80e6ceabd   grafana/grafana:10.4.1                  "/run.sh"                2 hours ago   Exited (0) 23 minutes ago               agraf
ef15aac87832   grafana/grafana-image-renderer:3.10.0   "dumb-init -- node b…"   2 hours ago   Exited (143) 23 minutes ago             agrafrender
eb26ca5f3fd1   prom/prometheus:v2.51.1                 "/bin/prometheus --w…"   2 hours ago   Exited (0) 23 minutes ago               aprom
2f55da091c07   grafana/promtail:2.9.5                  "/usr/bin/promtail -…"   2 hours ago   Exited (0) 23 minutes ago               promtail
63a518bba2ee   grafana/loki:2.9.5                      "/usr/bin/loki --ing…"   2 hours ago   Exited (0) 23 minutes ago               loki
8947a66210fd   prom/alertmanager:v0.26.0               "/bin/alertmanager -…"   2 hours ago   Exited (0) 23 minutes ago               aalert

Impact: Nothing can be send to monitor since upgrade happened what leads to test failures. Example of job.

mikliapko commented 3 months ago

@fruch I suppose the ideal solution would be to split this one node with manager-server and monitor into two separate nodes. But how much work it requires to be done and who may take care of this? Perhaps, you may any temporary workarounds to unblock Manager upgrade tests?

cc: @rayakurl

fruch commented 3 months ago

@fruch I suppose the ideal solution would be to split this one node with manager-server and monitor into two separate nodes. But how much work it requires to be done and who may take care of this? Perhaps, you may any temporary workarounds to unblock Manager upgrade tests?

cc: @rayakurl

well, yes we should consider separating for a whole bunch of reason, this is one small example.

but this one still point to an issue on how we operate the monitoring stack. we should move it's start/stop into a systemd configuration, that can survive reboot

or docker setup that can survive reboot.

one of those can fix this issue.

also one can add code to the manager upgrade code, to check if monitoring is up after upgrade, and spin it back up if not.

there's multiple address this issue.

mikliapko commented 3 months ago

also one can add code to the manager upgrade code, to check if monitoring is up after upgrade, and spin it back up if not.

In such approach it's still might be a chance to get into situation when something is sent to down monitor during upgrade. We should avoid any downtime for it.

mikliapko commented 1 day ago

Alright, I see that the upgrade flow we use in test is not the same we recommend in Scylla Manager upgrade documentation.

So, I suppose we need to update the upgrade flow in test to fit documentation. It should fix the issue described above.

fruch commented 14 hours ago

@amnonh

Do we have a way to run/configure the monitoring stack to survive docker service upgrade ? i.e. that it would spin back up when the upgrade is done ?

amnonh commented 8 hours ago

Have you tried using --auto-restart command line option to start-all.sh it may be enough