Migration playbooks (post-STFC shutdown) and the "grand playbook"

alanbchristie commented 9 months ago

Related to #1148 this issue describes components that were not part of the initial migration to allow the stack to continue to operate while its original cluster was out-of-action. This issue relates to the migration of the Production stack, not the migration of the entire stack development arena. This issue relates to the migration of all the resources required by a self-contained production stack.

The components that are missing in #1148 and required to have a fully operation stack are: -

Discourse
Squonk2 which includes:
- The Data Manager
- The Account Server
- The Job and Jupyter Operators
Neo4j for the graph database

This issue does not cover the deployment of AWX, an ansible playbook server used by the CI/CD process to automate the deployment of new application containers.

If updates are required to be supported we will need: -

Enhancements to playbooks to ensure that a cluster without AWX can be used to deploy new versions of all of the applications.

A Grand Playbook is something that is feasible (and could be developed). The prerequisites are: -

A target cluster exists with all the "3rd party" services pre-installed. These would include compatible: -
1. A storage class
2. NGINX ingress controller
3. Certificate Manager
The "grand playbook" would run in two distinct phases: an "installation phase" and then a "recovery phase".
The "grand playbook" would need simultaneous admin access (from one control machine) to the source and destination clusters.
The "grand playbook" would "extract" all of the original playbook variables from objects present in the source cluster. This would require the inspection of numerous "well known" kubernetes objects, including those that define objects like a Secret (to obtain usernames and passwords), ConfigMap (for additional configuration information), Pod (for environment variables) and Ingress (For hostnames, paths etc.).
With the data extracted it could then deploy a fresh (empty) "installation" of the source and then move to the "recovery" phase and do what was necessary to copy the relevant databases and file-system content to the destination (scaling down Pods and restarting them etc).
After the "installation" and "recovery" all that would be required would be a redirection of the domains to the new cluster.

Before the "grand playbook" could safely operate we would probably need to wait until the following conditions were met: -

Ensure no Squonk Jobs were running (or could run)
Ensure no fragalysis celery tasks were running (or could run)

The "grand playbook" will not be able to:-

Replicate Squonk Jupyter notebook instances, and Squonk would need to be able to "understand" that jupyter notebooks may be lost
Replicate Dataset volumes - instead the user would simply re-run any Fragalysis/Squonk Jobs, which would recreate the dataset volume data.

Replication of a live cluster will take considerable time. The graph database will take 8 to 12 hours to become live and, depending on the content of the Fragalysis media volume will take at least 30 minutes just to copy this data.

phraenquex commented 8 months ago

Also: a demonstrator database+media subset. 5 open targets would do the job. Use future ASAP targets, uploaded into v2.

alanbchristie commented 8 months ago

The relocation was generally successful and comprehensive documentation on the relocation of the production stack can be found on ReadTheDocs at: -

https://im-dls-fragalysis-stack-kubernetes.readthedocs.io/en/stable/relocating/index.html

The relocation currently suffers from an inability to generate certificates for wild-carded domains (see #1191)

m2ms / fragalysis-frontend

Migration playbooks (post-STFC shutdown) and the "grand playbook" #1182