Closed jayaddison closed 3 weeks ago
We have a spare drive available, and it should be possible to hot-add that to the affected host's drive bays, then to attach the resulting logical disk to the affected filesystem on the host, and wait for data replication to complete before finally turning off and removing the faulty drive.
We should also admit that this isn't a procedure that the team has practiced recently, so we'll try to be careful and methodical during the process.
I'll provide some more-detailed error message logs here soon.
sd 6:0:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK md_age=35s
Ok; it turns out that sdd
is not a data drive in the redundant disk array -- it is in fact the system/boot drive on the server.
That complicates matters, because it has no redundancy and would require downtime to replace.
Therefore I am considering whether this alert is an acceptable forcing-function to attempt to serve traffic from two servers in parallel, and then to switch traffic to the spare while we reinstall and rebuild the current server.
Therefore I am considering whether this alert is an acceptable forcing-function to attempt to serve traffic from two servers in parallel, and then to switch traffic to the spare while we reinstall and rebuild the current server.
In theory I think this would involve:
...and then much of the above again in reverse once the original primary is ready to rejoin the cluster.
I'm not sure I feel ready to configure PostgreSQL replication as a prerequisite for this process -- that's a lot to learn for the purpose of a one-time maintenance task. Perhaps it would be simpler to put the database into read-only mode -- with the exception of event logging -- and then to recombine the data from both hosts after maintenance is complete.
I'm not sure I feel ready to configure PostgreSQL replication as a prerequisite for this process -- that's a lot to learn for the purpose of a one-time maintenance task. Perhaps it would be simpler to put the database into read-only mode -- with the exception of event logging -- and then to recombine the data from both hosts after maintenance is complete.
That seems reasonable. So, the plan becomes:
Switching inbound HTTP traffic paths so that the secondary effectively becomes the primary for traffic serving.
Why not simply load-balance? Provided that we are logging to each node individually and recombining the logs later, they should be no requirement for a notion of primary/secondary.
Latest refinement of the plan:
Status updates:
- Add an additional server to the production Kubernetes cluster.
This is in-progress and will use an LXC-based unprivileged container; building that should follow more-or-less the standard installation process documented in this repository.
- Ensure that the additional server receives some of the load-balanced inbound HTTP traffic.
This will require the additional server to be able to handle TLS termination, because we don't currently have a separate HTTP load-balancer.
LetsEncrypt does apparently allow for up to 5 duplicate certificate issuances to exist within a 7-day period, so to avoid copying private key material across the network and to multiple systems, I think we could early-enable HTTP (plaintext, without TLS) on the additional host, and issue a custom cert for it that would could revoke after the maintenance is completed.
This is in-progress and will use an LXC-based unprivileged container; building that should follow more-or-less the standard installation process documented in this repository.
Running Kubernetes with a CRI-O runtime within unprivileged LXC containers appears not to be a preconfigured default with the available components, and so some assembly/configuration is required. So far I've applied the following adjustments to make progress:
On the host, configure these kernel sysctl
settings, to match what kubeadm
expects to be configured during cluster initialization:
kernel.panic=10
kernel.panic_on_oops=1
vm.overcommit_memory=1
sysctl.conf
file, or similar, under /etc/
.In the container, ensure that the /dev/kmsg
device exists, by running sudo ln -s /dev/console /dev/kmsg
sysctl.conf
suggestion above, we should persist this in configuration and/or create the device entry by using the LXC container config instead of a symlink.For the container's libcontainers
infrastructure, applying the workaround described in https://github.com/canonical/lxd/issues/10389 by:
fuse-overlayfs
packagestorage.options.overlay.mount_program = "/usr/bin/overlayfs
in storage.conf
/dev/fuse
device.In the container's CRIO configuration, set the following, appearing to workaround a conmon
-to-systemd
pathway that appears to have difficulty when no session bus is available (Failed to connect to bus: no medium found
):
crio.runtime.group_manager = "cgroupfs"
crio.runtime.conmon_cgroup = "pod"
The next problem appears to be that some libcontainers
code is fussy about process-level oom
(out-of-memory) configuration. I think we're getting close to having this working, though.
It's not trivial, but is certainly documentable.
The next problem appears to be that some
libcontainers
code is fussy about process-leveloom
(out-of-memory) configuration. I think we're getting close to having this working, though.
This is achieved/worked-around by configuring the not-entirely-supported _CRIO_ROOTLESS
environment variable (added to the /etc/default/crio
environment file in this case).
The next problem appears to be that some
libcontainers
code is fussy about process-leveloom
(out-of-memory) configuration. I think we're getting close to having this working, though.This is achieved/worked-around by configuring the not-entirely-supported
_CRIO_ROOTLESS
environment variable (added to the/etc/default/crio
environment file in this case).
(this instructions cri-o
that it should create rootless-compatible containers; runc
and crun
should correspondingly not attempt to configure oom_score_adj
for the processes that run those containers)
Add an additional server to the production Kubernetes cluster.
Rebuild and redeploy microservices from source on the additional server.
Success: a first search has been performed on the LXC-containerised instance of RecipeRadar on the secondary server.
Strictly speaking, it is an entirely self-contained, separate Kubernetes cluster. This wasn't the original plan, but it is perfectly acceptable at the moment.
Two small additional details had to be handled during the bringup:
api
cluster that had to be re-added.sysctl
value, vm.max_map_count=262144
was configured to allow OpenSearch to start successfully.Maintenance status:
tosDEpyfEwTpG5lKRNEx+cmZLgLMseo40EDtx4IOUp53N/ruQwMZVo9H/LLHsHBdiggHxsq5XwwlfHj0PcJgbw==
) matches the SRI hash present in our TXT B record for reciperadar.com
.Next steps:
All production traffic is now being served by the standby server; assuming the service remains stable, that will soon be powered off and the operating system disk drive will be replaced ready for a fresh installation.
The blog
service has not been deployed correctly, and outbound network connectivity problems have occurred during attempts to restore that. On balance, it seems worth proceeding with the maintenance anyway, although this is not ideal.
Status update:
The rebuild process included deployment of the blog
service, so it is confirmed functional again, although the TXT B integrity hash appears to be stale, and has been updated to accomodate the updated index page.
Three tasks remain:
Maintenance is complete, and the fauly system disk drive has been replaced.
In the near future it may make sense to evaluate what would be required to conduct the same style of maintenance process but with PostgreSQL, OpenSearch and Kubernetes nodes truly clustered rather than providing service from each of two independent hosts.
Describe the bug One of the disks in the production server appears to be emitting some fault messages -- details TBD, but they're the kind of write-delay / ATA timeout errors that often indicate that a drive is beginning to fail.
Fortunately the drive is part of a redundant array of disks, but even so, we should replace it soon to avoid the hassle of a larger system recovery/rebuild.To Reproduce N/A
Expected behavior No drive I/O errors should appear in the system
dmesg
output or inScreenshots I'll provide some more-detailed error message logs here soon.
Edit: update description: the disk in question is not part of a redundant storage array