openfoodfoundation / ofn-install

Ansible scripts for provisioning and deploying Open Food Network
54 stars 112 forks source link

Improve deployment setup #891

Closed Matt-Yorkley closed 1 year ago

Matt-Yorkley commented 1 year ago

We've had some issues with deployment this week (errors during deployment) and there are some things we can look at and improve with regards to the technical aspects of the deployment process.

The problem

During deployments there's a window of time between the point that the database is migrated and the point that the new code is applied. This means the database structure and the code are temporarily out of sync, with new changes in the structure of the database present while the old code is still live. This can result in nasty errors while users are interacting with the app in this period, and it comes to an end after the new code is applied and the server is restarted to pick up the changes. Lets call this The Danger Zone.

The solution

Ideally what we want is to minimise this period of time where the database and the code are out of sync. Taking a look at the deployment playbook (https://github.com/openfoodfoundation/ofn-install/blob/master/roles/deploy/tasks/deploy.yml#L134-L173 :eyes:), I can see we currently run the precompile assets task between the database migration and the point where the code is switched over. The compiling can take up to a couple of minutes, so obviously this is a really bad idea. If we can change the order of the tasks here so that the assets are precompiled first, then the database is migrated, then the code is switched immediately after, we will spend a lot less time in The Danger Zone and hence we should have a lot less errors/problems.

Background

From memory, I think in the past we had an issue that the playbook has to run in two different scenarios, namely 1) deploying on a fresh server and 2) deploying on an existing server. In the first case, the database hadn't necessarily been created or updated properly yet and I have a feeling this had caused errors to be thrown during the precompile rake task(?). I'm not sure if this still applies. One of the tricky things here is that the test suite used to check these scenarios so we would know if we broke something by messing with the deployment role, but its currently not working (needs fixing/updating) and manually testing this specific scenario (the first deployment on a completely fresh server) is quite time consuming (and a pain in the ass)...

Next steps

Anyway, the improvement we need is quite clear (precompile assets before migrating the database), and if there's an issue then we can either solve it by making a separate and slightly different playbook/role for the "first" deployment, or adding some checks and conditionals to maybe change the logic a bit on the "first" deploy.

jibees commented 1 year ago

Very nice and understandable explanation, thanks for this. 👏

mkllnk commented 1 year ago

Just adding some context here:

I would also note that, depending on how close you are to the server, each Ansible command can take several seconds and it would be better to execute the danger zone within on Ansible command, for example with a bash script, to finish it as quickly as possible.

dacook commented 1 year ago

Makes sense, great work.

Regarding the "first deploy", after thinking about this I think you're right. It was really nice and simple, that deploy "just works" first time. But perhaps if we move the first-time setup to another playbook, we'll have the freedom to better optimise the weekly deploy.

So I agree 👍

Regarding Ansistrano, I had a quick look and it looks like a good framework to use if we were starting from scratch, but I don't see any practical benefits it can bring now. In fact, like Maikel suggested I'd love to see more (if not all) of the process moved to a shell script..