Alert (resolved): production disk appears to be starting to fail

jayaddison commented 4 weeks ago

Describe the bug One of the disks in the production server appears to be emitting some fault messages -- details TBD, but they're the kind of write-delay / ATA timeout errors that often indicate that a drive is beginning to fail.

~~Fortunately the drive is part of a redundant array of disks~~, but even so, we should replace it soon to avoid the hassle of a larger system recovery/rebuild.

To Reproduce N/A

Expected behavior No drive I/O errors should appear in the system dmesg output or in

Screenshots I'll provide some more-detailed error message logs here soon.

Edit: update description: the disk in question is not part of a redundant storage array

jayaddison commented 4 weeks ago

We have a spare drive available, and it should be possible to hot-add that to the affected host's drive bays, then to attach the resulting logical disk to the affected filesystem on the host, and wait for data replication to complete before finally turning off and removing the faulty drive.

We should also admit that this isn't a procedure that the team has practiced recently, so we'll try to be careful and methodical during the process.

jayaddison commented 4 weeks ago

I'll provide some more-detailed error message logs here soon.

sd 6:0:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK md_age=35s

jayaddison commented 4 weeks ago

Ok; it turns out that sdd is not a data drive in the redundant disk array -- it is in fact the system/boot drive on the server.

That complicates matters, because it has no redundancy and would require downtime to replace.

Therefore I am considering whether this alert is an acceptable forcing-function to attempt to serve traffic from two servers in parallel, and then to switch traffic to the spare while we reinstall and rebuild the current server.

jayaddison commented 4 weeks ago

Therefore I am considering whether this alert is an acceptable forcing-function to attempt to serve traffic from two servers in parallel, and then to switch traffic to the spare while we reinstall and rebuild the current server.

In theory I think this would involve:

Adding the secondary server to the production Kubernetes cluster.
Ideally, adding the secondary server to the production PostgreSQL cluster.
Ensuring that the secondary is able to retrieve all relevant container images and run them in parallel.
Switching the write paths for event logging so that the secondary effectively becomes the primary for event-writes.
Switching inbound HTTP traffic paths so that the secondary effectively becomes the primary for traffic serving.
Shutting down the primary and confirming that event logging and production site service continue to function as expected.

...and then much of the above again in reverse once the original primary is ready to rejoin the cluster.

I'm not sure I feel ready to configure PostgreSQL replication as a prerequisite for this process -- that's a lot to learn for the purpose of a one-time maintenance task. Perhaps it would be simpler to put the database into read-only mode -- with the exception of event logging -- and then to recombine the data from both hosts after maintenance is complete.

jayaddison commented 4 weeks ago

I'm not sure I feel ready to configure PostgreSQL replication as a prerequisite for this process -- that's a lot to learn for the purpose of a one-time maintenance task. Perhaps it would be simpler to put the database into read-only mode -- with the exception of event logging -- and then to recombine the data from both hosts after maintenance is complete.

That seems reasonable. So, the plan becomes:

Adding the secondary server to the production Kubernetes cluster.
Ensuring that the secondary is able to retrieve all relevant container images and run them in parallel.
- Note: in practice, I think we'll be able to rebuild them from-source on the secondary instead.
Switching inbound HTTP traffic paths so that the secondary effectively becomes the primary for traffic serving.
Shutting down the primary and confirming that event logging and production site service continue to function as expected.

jayaddison commented 4 weeks ago

Switching inbound HTTP traffic paths so that the secondary effectively becomes the primary for traffic serving.

Why not simply load-balance? Provided that we are logging to each node individually and recombining the logs later, they should be no requirement for a notion of primary/secondary.

jayaddison commented 4 weeks ago

Latest refinement of the plan:

Add an additional server to the production Kubernetes cluster.
Rebuild and redeploy microservices from source on the additional server.
Ensure that the additional server receives some of the load-balanced inbound HTTP traffic.
Remove the existing server that requires drive replacement and system reinstallation.

jayaddison commented 4 weeks ago

Status updates:

Add an additional server to the production Kubernetes cluster.

This is in-progress and will use an LXC-based unprivileged container; building that should follow more-or-less the standard installation process documented in this repository.

Ensure that the additional server receives some of the load-balanced inbound HTTP traffic.

This will require the additional server to be able to handle TLS termination, because we don't currently have a separate HTTP load-balancer.

LetsEncrypt does apparently allow for up to 5 duplicate certificate issuances to exist within a 7-day period, so to avoid copying private key material across the network and to multiple systems, I think we could early-enable HTTP (plaintext, without TLS) on the additional host, and issue a custom cert for it that would could revoke after the maintenance is completed.

jayaddison commented 3 weeks ago

This is in-progress and will use an LXC-based unprivileged container; building that should follow more-or-less the standard installation process documented in this repository.

Running Kubernetes with a CRI-O runtime within unprivileged LXC containers appears not to be a preconfigured default with the available components, and so some assembly/configuration is required. So far I've applied the following adjustments to make progress:

On the host, configure these kernel sysctl settings, to match what kubeadm expects to be configured during cluster initialization:
- kernel.panic=10
- kernel.panic_on_oops=1
- vm.overcommit_memory=1
- Note: Kubernetes uses check-or-set logic; if the values match what it expects, it does nothing - that's nice, because it means the container does not need to write system settings.
- Note also: this does however mean that the host has to be preconfigured with these values. We should persist them in a sysctl.conf file, or similar, under /etc/.
In the container, ensure that the /dev/kmsg device exists, by running sudo ln -s /dev/console /dev/kmsg
- Similar to the sysctl.conf suggestion above, we should persist this in configuration and/or create the device entry by using the LXC container config instead of a symlink.
For the container's libcontainers infrastructure, applying the workaround described in https://github.com/canonical/lxd/issues/10389 by:
- Installing the fuse-overlayfs package
- Configuring storage.options.overlay.mount_program = "/usr/bin/overlayfs in storage.conf
- On the host, updating the LXC's container configuration to follow this advice to bind-mount a /dev/fuse device.
In the container's CRIO configuration, set the following, appearing to workaround a conmon-to-systemd pathway that appears to have difficulty when no session bus is available (Failed to connect to bus: no medium found):
- crio.runtime.group_manager = "cgroupfs"
- crio.runtime.conmon_cgroup = "pod"

The next problem appears to be that some libcontainers code is fussy about process-level oom (out-of-memory) configuration. I think we're getting close to having this working, though.

It's not trivial, but is certainly documentable.

jayaddison commented 3 weeks ago

The next problem appears to be that some libcontainers code is fussy about process-level oom (out-of-memory) configuration. I think we're getting close to having this working, though.

This is achieved/worked-around by configuring the not-entirely-supported _CRIO_ROOTLESS environment variable (added to the /etc/default/crio environment file in this case).

jayaddison commented 3 weeks ago

The next problem appears to be that some libcontainers code is fussy about process-level oom (out-of-memory) configuration. I think we're getting close to having this working, though.

This is achieved/worked-around by configuring the not-entirely-supported _CRIO_ROOTLESS environment variable (added to the /etc/default/crio environment file in this case).

(this instructions cri-o that it should create rootless-compatible containers; runc and crun should correspondingly not attempt to configure oom_score_adj for the processes that run those containers)

jayaddison commented 3 weeks ago

Add an additional server to the production Kubernetes cluster.

Rebuild and redeploy microservices from source on the additional server.

Success: a first search has been performed on the LXC-containerised instance of RecipeRadar on the secondary server.

Strictly speaking, it is an entirely self-contained, separate Kubernetes cluster. This wasn't the original plan, but it is perfectly acceptable at the moment.

Two small additional details had to be handled during the bringup:

There is one secret that is provided to the api cluster that had to be re-added.
An additional sysctl value, vm.max_map_count=262144 was configured to allow OpenSearch to start successfully.

jayaddison commented 3 weeks ago

Maintenance status:

The standby server has been introduced to the production network, and is serving traffic on TCP port 80 (HTTP) but not yet TCP port 443 (HTTPS).
The existing server continues to serve TCP port 443 (HTTPS) traffic.
Both servers are confirmed to be responding with the same HTML index-page content, and the SHA512 base64 digest of that content (tosDEpyfEwTpG5lKRNEx+cmZLgLMseo40EDtx4IOUp53N/ruQwMZVo9H/LLHsHBdiggHxsq5XwwlfHj0PcJgbw==) matches the SRI hash present in our TXT B record for reciperadar.com.

Next steps:

[x] Request a TLS certificate for the standby host from LetsEncrypt so that it can begin serving HTTPS traffic.
[x] Confirm that searches are being logged to the standby server database.

jayaddison commented 3 weeks ago

All production traffic is now being served by the standby server; assuming the service remains stable, that will soon be powered off and the operating system disk drive will be replaced ready for a fresh installation.

The blog service has not been deployed correctly, and outbound network connectivity problems have occurred during attempts to restore that. On balance, it seems worth proceeding with the maintenance anyway, although this is not ideal.

jayaddison commented 3 weeks ago

Status update:

The faulty system drive has been replaced, and the existing server has been fresh-reinstalled.
The reinstalled system has been reintroduced to the cluster and has been promoted to the primary, and is serving both HTTP and HTTPS traffic.
The standby server has been removed from the cluster.

The rebuild process included deployment of the blog service, so it is confirmed functional again, although the TXT B integrity hash appears to be stale, and has been updated to accomodate the updated index page.

Three tasks remain:

[x] Revoking the temporary certificate used to serve traffic from the standby server during maintenance.
[x] Recombining the event logs that accumulated on the standby server during the maintenance service into the primary database.
[x] Deploying an updated backup from the primary database.

jayaddison commented 3 weeks ago

Maintenance is complete, and the fauly system disk drive has been replaced.

In the near future it may make sense to evaluate what would be required to conduct the same style of maintenance process but with PostgreSQL, OpenSearch and Kubernetes nodes truly clustered rather than providing service from each of two independent hosts.

openculinary / infrastructure

Alert (resolved): production disk appears to be starting to fail #47