ministryofjustice / technical-risk-measures

Discussion around how we measure technical risk in the Ministry of Justice.
4 stars 1 forks source link

How do we measure "code can be changed" and "team owns deployment" for a managed service? #1

Closed minglis closed 4 years ago

minglis commented 5 years ago

In some services the application is managed by a third party. Our questions pre-suppose that we have an MOJ team running the service. If this isn't the case what are the criteria for answering those questions?

jennyd commented 5 years ago

"Code base can be easily changed?" is there to cover the total time to make a change, rather than just the deployment phase.

We added it to complement "Team can deploy in working hours?", "Team who owns the app own the complete deployment?" and "Can deploy multiple times a day?", but maybe these four can be condensed into two along these lines:

Can we make a change to the service with a reasonable and proportionate amount of effort, time and money?

This one could be affected by factors including:

People's understanding of "reasonable and proportionate" could vary a lot across the org though - I'm not sure how we could be more specific to achieve consistency while still being appropriate for a wide range of scales.

Can we deploy changes to the live service quickly and easily, whenever we want to, during normal working hours, without causing inconvenient downtime for users?

This one would capture slow/flaky/manual/downtime-causing deployment processes, as well as those which have dependencies on multiple teams or time-consuming approval processes. "Quickly and easily" is also open to interpretation, but I suspect less so than "reasonable and proportionate" in the first one.

"Inconvenient" is also open to interpretation, but I'm intending it to suggest that it can be fine for less frequent changes such as database upgrades, hosting migrations etc to involve some element of planned downtime for users, depending on scale/criticality of the system. We should be wary of implying that all deploys should be zero-downtime, regardless of the cost of achieving that - while still adopting practices which minimise downtime by default :)


Suggestions welcome! I'm aware that these two higher-level questions both cover a lot of ground, which could hide complexity or distinctions which are better exposed. Maybe we should wait till we have more data across a wider range of contexts before deciding on this.