docs(GLOSSARY.md): reintroduce Break Glass

lloydchang commented 2 years ago

• reintroduce Break Glass from RC1 at https://github.com/open-gitops/documents/pull/21 • prepend with · · · — — — · · · 🆘 which: 1. make this idiom/analogy more accessible for a global audience

alphabetize to the bottom of glossary

lloydchang commented 2 years ago

This addresses:

make this idiom/analogy more accessible for a global audience
July 28, 2021 - Principles weekly
- Topics:
  - @heubeck [Florian] Glossary explains “Break Glass” but not why it’s named that way
  - @moshloop Moshe :explained
  - May want to make this idiom/analogy more accessible for a global audience
  - Issue https://github.com/open-gitops/documents/issues/73
https://docs.google.com/document/d/1hxifmCdOV5_FbKloDJRWZQHq0ge-trXJKF-BgV4wHVk/edit#heading=h.78a62j9lu1di

and

alphabetize to the bottom of glossary

@scottrigby wrote in https://github.com/open-gitops/documents/pull/40#issuecomment-948898125

Less importantly (but still a factor), because "break glass" starts with "B" it was the first thing people read about when opening the glossary – which tells them when NOT to do GitOps (or when to pause, etc). Even though it should be mentioned somewhere (perhaps best practices?) leading with this seemed like not putting our best foot forward

grmhay commented 2 years ago

Lloyd I'm happy to prepend something so it doesn't become a negative to GitOps adoption. The whole point of proposing this was to make sure folks don't get stuck in their adoption because "source of truth" unavailability would be something they'd have to think through on their own.

lloydchang commented 2 years ago

TL;DR:

• It's fine to append to Break Glass in the GitOps Glossary (hence this pull request reintroduces it), but I wouldn't explicitly list Break Glass in the GitOps Principles until we have a better understanding of:

❔ Exactly where are the root causes of the problem @grmhay had summarized, and why?

❔ Are we defining source of truth very differently? (please see below)

@grmhay wrote at https://github.com/open-gitops/documents/pull/37#issue-1024509858:

We (Morgan Stanley) believe that the situation where the source of truth for desired state (e.g. github.com or a git-equivalent that an enterprise may run) is less available than your users' expected SLA for making configuration changes is being left by the community as an issue for the implementer to overcome.

Put succinctly, if Github is unavailable and you want to make changes to your System State, there should be one approach and a set of tooling to allow reconciliation after the fact.

This will both harm adoption of gitops and is inefficient as I believe we shared a common challenge that we can solve once within the project.

The first step, as this project has so well established, is a glossary of terms to allow us to describe the problem and a draft principle to add. I have included these in this PR.

Hi @grmhay, @ebourgeois, @scottrigby, @christianh814, @todaywasawesome

I don't know which GitOps tool @grmhay is using. If @grmhay is using Flux CD v2, then there is flux suspend source git and flux resume source git

One plausible approach and set of tooling could look like:

Optionally(?), suspend reconciliation of a GitRepository resource, i.e. flux suspend source git
Run git commit locally
Temporarily, run kubectl apply
Eventually run git push locally after remote git is available
Optionally(?), resume a suspended GitRepository, i.e. flux resume source git

Instead of GitOps, remote git, i.e. github.com and GitHub Enterprise (Server and/or Cloud) seems to be one of many(?) root causes of the problem that @grmhay described.

❔ Can the problem be better solved at the remote git implementation level?

❔ Curiously, would @grmhay have a conversation about Service Level Availability (SLA), High Availability (HA), Active-active, Geo-replication, etc. with the administrator(s) of @grmhay's GitHub Enterprise on-premise, GitHub Enterprise Cloud Support, and/or github.com Support?

Git (originating from git.kernel.org) is designed to be distributed. Out of the box, git works fine locally, e.g. git commit in offline mode. Online mode is needed for git push from one local git to many remote gits.

Remote gits can be managed as a cloud service, or hosted and replicated on-premise, e.g.

• GH: github.com: "We expect that most of these monthly updates will recap periods of time where GitHub was >99% available"

• GHEC: GitHub Enterprise Cloud: "99.95% uptime SLA"

• GHE: Geo-replication on GitHub Enterprise Server

• BDC: Bitbucket Data Center

• WGM: WANDisco Git Multisite

• GL: GitLab active-active git replication

Source of truth:

@grmhay wrote at https://github.com/open-gitops/documents/pull/42#issuecomment-953140715

The whole point of proposing this was to make sure folks don't get stuck in their adoption because "source of truth" unavailability would be something they'd have to think through on their own.

❔ Perhaps @grmhay's "source of truth" refers to a managed service specifically because the word "unavailability" is used?

• From my perspective, the source of truth isn't because of a remote git managed service and its uptime availability at all.

• While there can be a github.com with uptime availability of >99%, or GitHub Enterprise Cloud with uptime availability of 99.95%, their uptime availability is unrelated to the source of truth in the format of a unique SHA hash.

• There is a source of truth because each Git commit has a unique SHA hash across all Git repositories in the Universe.

I empathize with folks so they don't get stuck. For any single point of failure (SPOF), folks will still need to think through on their own depending on exactly where the root causes of the problem are, and why?

Points of failure can happen in many places — from one central system lacking active-active high availability, to federated identity, to distributed computer networking (BGP hijacking or DNS hijacking).

I don't know if @grmhay's specific setup is air-gapped or not. For what it's worth, there is an air-gapped use case described at How the U.S. Army Software Factory and Enterprise Cloud Management Agency are using Carvel and Cluster API to declaratively manage Kubernetes workloads and clusters in secure air-gapped environments.

GitHub, Git, GitOps are different things:

If the root causes of many(?) are with GitHub Enterprise Cloud, or GitHub Enterprise on-premise, or github.com, then that is at least two orders of magnitude between:

GitHub with pull requests
Git with commits
GitOps with principles

Concretely, if a pull request cannot happen because github.com is unavailable, then the root causes are directly at GitHub. At this point, the root causes aren't directly at local gits on end users' computers, nor directly at GitOps controllers running in Kubernetes and their own set of gits.

That being said, it appears that one of the GitOps tool implementations, Flux CD v2, provides flux suspend source git and flux resume source git — Optionally(?), they may be applicable to the situation: GitHub is unavailable.

To recap:

• It's fine to append to Break Glass in the GitOps Glossary (hence this pull request reintroduces it), but I wouldn't explicitly list Break Glass in the GitOps Principles until we have a better understanding of:

❔ Exactly where are the root causes of the problem @grmhay had summarized, and why?

❔ Are we defining source of truth very differently? (please see above)

Thank you @grmhay, @ebourgeois, @scottrigby, @christianh814, @todaywasawesome for your time 🙂

lloydchang commented 2 years ago

Above https://github.com/open-gitops/documents/pull/42#issuecomment-955603211

relates to

https://cloud-native.slack.com/archives/C01G9DEE88M/p1639546167386800?thread_ts=1639072194.295800

Sociotechnical Considerations for GitOps: ‎ While Availability is a basic principle of Information Security…

Why do companies centralize Git? ... for a lot of companies, it doesn’t make sense to spend resources on re-engineering Git hosting for higher availability to mitigate issues that will most likely not affect their business. ‎

The trade-offs between High Availability versus CVCS are unique to each organization’s nuances.

Is it frugality?

DVCS with auditable code reviews, multi-master replication & multi-site exist for organizations that spend resources.

Solutions exist in free and open-source software, and from vendors, for example: ‎

Gerrit Multi-Master Configuration: With multiple Gerrit masters it is possible to mitigate server load by allowing users to access a server which has more free resources, and it is also possible to provide higher availability by allowing service to be transferred to any remaining masters when a master fails. ‎

Gerrit Multi-master and Multi-site: an OpenSource solution: • In 2018, Qualcomm went live with a Gerrit multi-master setup • In 2019, GerritHub.io went multi-master and multi-site

and

https://cloud-native.slack.com/archives/C01G9DEE88M/p1639553689387200?thread_ts=1639072194.295800

• Git DVCS communities already have solutions for DVCS replication • GitOps community doesn’t need to reinvent the wheel at all

The fundamental issues are sociotechnical — When many organizations want both Frugality and to Think Big, there is conflict with (un-)healthy tension.

These fundamental & sociotechnical issues are far beyond the scope of the GitOps Working Group Charter.

Thank you all 🙂

open-gitops / documents

docs(GLOSSARY.md): reintroduce Break Glass #42