openculinary / infrastructure

This repository documents the steps required to set up a fresh RecipeRadar environment
GNU Affero General Public License v3.0
5 stars 5 forks source link

Alert (resolved): production disk appears to be starting to fail #47

Closed jayaddison closed 3 weeks ago

jayaddison commented 4 weeks ago

Describe the bug One of the disks in the production server appears to be emitting some fault messages -- details TBD, but they're the kind of write-delay / ATA timeout errors that often indicate that a drive is beginning to fail.

Fortunately the drive is part of a redundant array of disks, but even so, we should replace it soon to avoid the hassle of a larger system recovery/rebuild.

To Reproduce N/A

Expected behavior No drive I/O errors should appear in the system dmesg output or in

Screenshots I'll provide some more-detailed error message logs here soon.

Edit: update description: the disk in question is not part of a redundant storage array

jayaddison commented 4 weeks ago

We have a spare drive available, and it should be possible to hot-add that to the affected host's drive bays, then to attach the resulting logical disk to the affected filesystem on the host, and wait for data replication to complete before finally turning off and removing the faulty drive.

We should also admit that this isn't a procedure that the team has practiced recently, so we'll try to be careful and methodical during the process.

jayaddison commented 4 weeks ago

I'll provide some more-detailed error message logs here soon.

sd 6:0:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK md_age=35s
jayaddison commented 4 weeks ago

Ok; it turns out that sdd is not a data drive in the redundant disk array -- it is in fact the system/boot drive on the server.

That complicates matters, because it has no redundancy and would require downtime to replace.

Therefore I am considering whether this alert is an acceptable forcing-function to attempt to serve traffic from two servers in parallel, and then to switch traffic to the spare while we reinstall and rebuild the current server.

jayaddison commented 4 weeks ago

Therefore I am considering whether this alert is an acceptable forcing-function to attempt to serve traffic from two servers in parallel, and then to switch traffic to the spare while we reinstall and rebuild the current server.

In theory I think this would involve:

...and then much of the above again in reverse once the original primary is ready to rejoin the cluster.

I'm not sure I feel ready to configure PostgreSQL replication as a prerequisite for this process -- that's a lot to learn for the purpose of a one-time maintenance task. Perhaps it would be simpler to put the database into read-only mode -- with the exception of event logging -- and then to recombine the data from both hosts after maintenance is complete.

jayaddison commented 4 weeks ago

I'm not sure I feel ready to configure PostgreSQL replication as a prerequisite for this process -- that's a lot to learn for the purpose of a one-time maintenance task. Perhaps it would be simpler to put the database into read-only mode -- with the exception of event logging -- and then to recombine the data from both hosts after maintenance is complete.

That seems reasonable. So, the plan becomes:

jayaddison commented 4 weeks ago

Switching inbound HTTP traffic paths so that the secondary effectively becomes the primary for traffic serving.

Why not simply load-balance? Provided that we are logging to each node individually and recombining the logs later, they should be no requirement for a notion of primary/secondary.

jayaddison commented 4 weeks ago

Latest refinement of the plan:

jayaddison commented 4 weeks ago

Status updates:

  • Add an additional server to the production Kubernetes cluster.

This is in-progress and will use an LXC-based unprivileged container; building that should follow more-or-less the standard installation process documented in this repository.

  • Ensure that the additional server receives some of the load-balanced inbound HTTP traffic.

This will require the additional server to be able to handle TLS termination, because we don't currently have a separate HTTP load-balancer.

LetsEncrypt does apparently allow for up to 5 duplicate certificate issuances to exist within a 7-day period, so to avoid copying private key material across the network and to multiple systems, I think we could early-enable HTTP (plaintext, without TLS) on the additional host, and issue a custom cert for it that would could revoke after the maintenance is completed.

jayaddison commented 3 weeks ago

This is in-progress and will use an LXC-based unprivileged container; building that should follow more-or-less the standard installation process documented in this repository.

Running Kubernetes with a CRI-O runtime within unprivileged LXC containers appears not to be a preconfigured default with the available components, and so some assembly/configuration is required. So far I've applied the following adjustments to make progress:

The next problem appears to be that some libcontainers code is fussy about process-level oom (out-of-memory) configuration. I think we're getting close to having this working, though.

It's not trivial, but is certainly documentable.

jayaddison commented 3 weeks ago

The next problem appears to be that some libcontainers code is fussy about process-level oom (out-of-memory) configuration. I think we're getting close to having this working, though.

This is achieved/worked-around by configuring the not-entirely-supported _CRIO_ROOTLESS environment variable (added to the /etc/default/crio environment file in this case).

jayaddison commented 3 weeks ago

The next problem appears to be that some libcontainers code is fussy about process-level oom (out-of-memory) configuration. I think we're getting close to having this working, though.

This is achieved/worked-around by configuring the not-entirely-supported _CRIO_ROOTLESS environment variable (added to the /etc/default/crio environment file in this case).

(this instructions cri-o that it should create rootless-compatible containers; runc and crun should correspondingly not attempt to configure oom_score_adj for the processes that run those containers)

jayaddison commented 3 weeks ago
  • Add an additional server to the production Kubernetes cluster.

  • Rebuild and redeploy microservices from source on the additional server.

Success: a first search has been performed on the LXC-containerised instance of RecipeRadar on the secondary server.

Strictly speaking, it is an entirely self-contained, separate Kubernetes cluster. This wasn't the original plan, but it is perfectly acceptable at the moment.

Two small additional details had to be handled during the bringup:

jayaddison commented 3 weeks ago

Maintenance status:

Next steps:

jayaddison commented 3 weeks ago

All production traffic is now being served by the standby server; assuming the service remains stable, that will soon be powered off and the operating system disk drive will be replaced ready for a fresh installation.

The blog service has not been deployed correctly, and outbound network connectivity problems have occurred during attempts to restore that. On balance, it seems worth proceeding with the maintenance anyway, although this is not ideal.

jayaddison commented 3 weeks ago

Status update:

The rebuild process included deployment of the blog service, so it is confirmed functional again, although the TXT B integrity hash appears to be stale, and has been updated to accomodate the updated index page.

Three tasks remain:

jayaddison commented 3 weeks ago

Maintenance is complete, and the fauly system disk drive has been replaced.

In the near future it may make sense to evaluate what would be required to conduct the same style of maintenance process but with PostgreSQL, OpenSearch and Kubernetes nodes truly clustered rather than providing service from each of two independent hosts.