Closed lloydchang closed 2 years ago
This addresses:
July 28, 2021 - Principles weekly
- Topics:
- @heubeck [Florian] Glossary explains “Break Glass” but not why it’s named that way
- @moshloop Moshe :explained
- May want to make this idiom/analogy more accessible for a global audience
- Issue https://github.com/open-gitops/documents/issues/73
and
@scottrigby wrote in https://github.com/open-gitops/documents/pull/40#issuecomment-948898125
Less importantly (but still a factor), because "break glass" starts with "B" it was the first thing people read about when opening the glossary – which tells them when NOT to do GitOps (or when to pause, etc). Even though it should be mentioned somewhere (perhaps best practices?) leading with this seemed like not putting our best foot forward
Lloyd I'm happy to prepend something so it doesn't become a negative to GitOps adoption. The whole point of proposing this was to make sure folks don't get stuck in their adoption because "source of truth" unavailability would be something they'd have to think through on their own.
TL;DR:
• It's fine to append to Break Glass in the GitOps Glossary (hence this pull request reintroduces it), but I wouldn't explicitly list Break Glass in the GitOps Principles until we have a better understanding of:
❔ Exactly where are the root causes of the problem @grmhay had summarized, and why?
❔ Are we defining source of truth very differently? (please see below)
@grmhay wrote at https://github.com/open-gitops/documents/pull/37#issue-1024509858:
We (Morgan Stanley) believe that the situation where the source of truth for desired state (e.g. github.com or a git-equivalent that an enterprise may run) is less available than your users' expected SLA for making configuration changes is being left by the community as an issue for the implementer to overcome.
Put succinctly, if Github is unavailable and you want to make changes to your System State, there should be one approach and a set of tooling to allow reconciliation after the fact.
This will both harm adoption of gitops and is inefficient as I believe we shared a common challenge that we can solve once within the project.
The first step, as this project has so well established, is a glossary of terms to allow us to describe the problem and a draft principle to add. I have included these in this PR.
Hi @grmhay, @ebourgeois, @scottrigby, @christianh814, @todaywasawesome
I don't know which GitOps tool @grmhay is using. If @grmhay is using Flux CD v2, then there is flux suspend source git
and flux resume source git
One plausible approach and set of tooling could look like:
flux suspend source git
git commit
locallykubectl apply
git push
locally after remote git is availableflux resume source git
Instead of GitOps, remote git, i.e. github.com and GitHub Enterprise (Server and/or Cloud) seems to be one of many(?) root causes of the problem that @grmhay described.
❔ Can the problem be better solved at the remote git implementation level?
❔ Curiously, would @grmhay have a conversation about Service Level Availability (SLA), High Availability (HA), Active-active, Geo-replication, etc. with the administrator(s) of @grmhay's GitHub Enterprise on-premise, GitHub Enterprise Cloud Support, and/or github.com Support?
Git (originating from git.kernel.org) is designed to be distributed. Out of the box, git
works fine locally, e.g. git commit
in offline mode. Online mode is needed for git push
from one local git to many remote gits.
Remote gits can be managed as a cloud service, or hosted and replicated on-premise, e.g.
• GHEC: GitHub Enterprise Cloud: "99.95%
uptime SLA"
• GHE: Geo-replication on GitHub Enterprise Server
• BDC: Bitbucket Data Center
• WGM: WANDisco Git Multisite
• GL: GitLab active-active git replication
Source of truth:
@grmhay wrote at https://github.com/open-gitops/documents/pull/42#issuecomment-953140715
The whole point of proposing this was to make sure folks don't get stuck in their adoption because "source of truth" unavailability would be something they'd have to think through on their own.
❔ Perhaps @grmhay's "source of truth" refers to a managed service specifically because the word "unavailability" is used?
• From my perspective, the source of truth isn't because of a remote git managed service and its uptime availability at all.
• While there can be a github.com with uptime availability of >99%
, or GitHub Enterprise Cloud with uptime availability of 99.95%
, their uptime availability is unrelated to the source of truth in the format of a unique SHA hash.
• There is a source of truth because each Git commit has a unique SHA hash across all Git repositories in the Universe.
I empathize with folks so they don't get stuck. For any single point of failure (SPOF), folks will still need to think through on their own depending on exactly where the root causes of the problem are, and why?
Points of failure can happen in many places — from one central system lacking active-active high availability, to federated identity, to distributed computer networking (BGP hijacking or DNS hijacking).
I don't know if @grmhay's specific setup is air-gapped or not. For what it's worth, there is an air-gapped use case described at How the U.S. Army Software Factory and Enterprise Cloud Management Agency are using Carvel and Cluster API to declaratively manage Kubernetes workloads and clusters in secure air-gapped environments.
GitHub, Git, GitOps are different things:
If the root causes of many(?) are with GitHub Enterprise Cloud, or GitHub Enterprise on-premise, or github.com, then that is at least two orders of magnitude between:
Concretely, if a pull request cannot happen because github.com is unavailable, then the root causes are directly at GitHub. At this point, the root causes aren't directly at local gits on end users' computers, nor directly at GitOps controllers running in Kubernetes and their own set of gits.
That being said, it appears that one of the GitOps tool implementations, Flux CD v2, provides flux suspend source git
and flux resume source git
— Optionally(?), they may be applicable to the situation: GitHub is unavailable.
To recap:
• It's fine to append to Break Glass in the GitOps Glossary (hence this pull request reintroduces it), but I wouldn't explicitly list Break Glass in the GitOps Principles until we have a better understanding of:
❔ Exactly where are the root causes of the problem @grmhay had summarized, and why?
❔ Are we defining source of truth very differently? (please see above)
Thank you @grmhay, @ebourgeois, @scottrigby, @christianh814, @todaywasawesome for your time 🙂
Above https://github.com/open-gitops/documents/pull/42#issuecomment-955603211
relates to
https://cloud-native.slack.com/archives/C01G9DEE88M/p1639546167386800?thread_ts=1639072194.295800
Sociotechnical Considerations for GitOps: While Availability is a basic principle of Information Security…
The trade-offs between High Availability versus CVCS are unique to each organization’s nuances.
Is it frugality?
DVCS with auditable code reviews, multi-master replication & multi-site exist for organizations that spend resources.
Solutions exist in free and open-source software, and from vendors, for example:
Gerrit Multi-Master Configuration: With multiple Gerrit masters it is possible to mitigate server load by allowing users to access a server which has more free resources, and it is also possible to provide higher availability by allowing service to be transferred to any remaining masters when a master fails.
Gerrit Multi-master and Multi-site: an OpenSource solution: • In 2018, Qualcomm went live with a Gerrit multi-master setup • In 2019, GerritHub.io went multi-master and multi-site
and
https://cloud-native.slack.com/archives/C01G9DEE88M/p1639553689387200?thread_ts=1639072194.295800
• Git DVCS communities already have solutions for DVCS replication • GitOps community doesn’t need to reinvent the wheel at all
The fundamental issues are sociotechnical — When many organizations want both Frugality and to Think Big, there is conflict with (un-)healthy tension.
These fundamental & sociotechnical issues are far beyond the scope of the GitOps Working Group Charter.
Thank you all 🙂
• reintroduce Break Glass from RC1 at https://github.com/open-gitops/documents/pull/21 • prepend with · · · — — — · · · 🆘 which: 1. make this idiom/analogy more accessible for a global audience