kubernetes / sig-release

Repo for SIG release
Apache License 2.0
544 stars 393 forks source link

Release dependency management improvements umbrella #601

Open tpepper opened 5 years ago

tpepper commented 5 years ago

There is an ongoing need for better managing external dependencies.

The release team regularly scrambles to collect the current preferred dependency versions. These are inconsistently articulated in multiple files across multiple repos in non-machine-readable ways. And some are even untracked outside of anecdotal lore.

Various prior issues have been opened, for example https://github.com/kubernetes/sig-release/issues/400 and this regularly comes up in release retrospectives.

SIG Release needs to draft a KEP for implementation by the release team to outline the problem space, possible solutions. We need a machine readable, structured, single source of truth. It should have a broad OWNERS set to get wide review on changes and not be blocked on a small set of reviewers. Code in the project that needs to get “etcd” should get the version specified in this file. Release notes should draw from this file and its changelog. A PR changing a dependency in this file might get a special label, insured release notes inclusion, and special review. Special review can be needed to insure one group doesn't upgrade for a fix, introduce a regression in some other code, those owners revert the upgrade, re-introducing the prior bug (this has actually happened multiple times).

One potential problem with this approach, which has been a past blocker, is that this could mean work in a sub-project repo requires checking out some other repo in order to get this hypothetical yaml saying what are the preferred versions.

jeefy commented 5 years ago

+1000

I'm now more sad I couldn't make the meeting today

On Tue, Apr 23, 2019, 17:32 Tim Pepper notifications@github.com wrote:

There is an ongoing need for better managing external dependencies.

The release team regularly scrambles to collect the current preferred dependency versions. These are inconsistently articulated in multiple files across multiple repos in non-machine-readable ways. And some are even untracked outside of anecdotal lore.

Various prior issues have been opened, for example #400 https://github.com/kubernetes/sig-release/issues/400 and this regularly comes up in release retrospectives.

SIG Release needs to draft a KEP for implementation by the release team to outline the problem space, possible solutions. We need a machine readable, structured, single source of truth. It should have a broad OWNERS set to get wide review on changes and not be blocked on a small set of reviewers. Code in the project that needs to get “etcd” should get the version specified in this file. Release notes should draw from this file and its changelog. A PR changing a dependency in this file might get a special label, insured release notes inclusion, and special review. Special review can be needed to insure one group doesn't upgrade for a fix, introduce a regression in some other code, those owners revert the upgrade, re-introducing the prior bug (this has actually happened multiple times).

One potential problem with this approach, which has been a past blocker, is that this could mean work in a sub-project repo requires checking out some other repo in order to get this hypothetical yaml saying what are the preferred versions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/sig-release/issues/601, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKUO6C4OAKAS6XTXQ4ZUS3PR557BANCNFSM4HH6SQZA .

justaugustus commented 5 years ago

/area release-eng /priority important-soon

claurence commented 5 years ago

Circling back to this item - I expressed interest in helping out here - I mostly care for the purposes of 1.15 for defining what is the list of dependencies that we need to care about

I can start a draft of a KEP for what are those dependencies

yastij commented 5 years ago

/assign

figo commented 5 years ago

/cc

justaugustus commented 5 years ago

Initial PR for discussion here: https://github.com/kubernetes/kubernetes/pull/79366

justaugustus commented 5 years ago

Notice sent to k-dev, @kubernetes/sig-release, @kubernetes/release-team, and @kubernetes/release-engineering regarding the merged changes in https://github.com/kubernetes/kubernetes/pull/79366: https://groups.google.com/d/topic/kubernetes-dev/cTaYyb1a18I/discussion

yastij commented 5 years ago

also error message improvements here: kubernetes/kubernetes#80060

justaugustus commented 5 years ago

build/external: Move dependencies.yaml and update OWNERS - https://github.com/kubernetes/kubernetes/pull/80799

tpepper commented 5 years ago

I propose we un-milestone 1.16 this umbrella issue and remove the area release team, assuming that the release notes team for 1.16 (@saschagrunert @onyiny-ang @cartyc @kcmartin @paulbouwer ) have the dependencies.yaml file documented and codified as the source of info for the dependencies section https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#dependencies

It may also be about time to go ahead and close this as complete for the first go and move from the giant long lived umbrella issue to smaller point issues (like is happening already above) for incremental improvement.

saschagrunert commented 5 years ago

Thanks for the hint, I assume that we still update the release notes dependency section manually for 1.16. :)

That may be a bit out of scope, but I wrote a tool some time ago to diff go modules between git releases automatically: https://github.com/saschagrunert/go-modiff

lachie83 commented 5 years ago

/remove-area area/release-team

k8s-ci-robot commented 5 years ago

@lachie83: Those labels are not set on the issue: area/area/release-team

In response to [this](https://github.com/kubernetes/sig-release/issues/601#issuecomment-523541135): >/remove-area area/release-team Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
lachie83 commented 5 years ago

/remove-area release-team

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

justaugustus commented 4 years ago

/lifecycle frozen /unassign @claurence

justaugustus commented 4 years ago

@Pluies reached out to me before the holidays with this:

Hi Stephen!

Florent here, I wanted to get in touch to thank you for sharing this – I've been thinking about the "pinned infra dependencies" problem for a long time, and really enjoyed reading about the way Kubernetes deal with this!

I've used it as an inspiration to write Zeitgeist, a language-agnostic dependency checker: https://github.com/Pluies/zeitgeist

It includes the "dependencies declaration in yaml" and "checking all occurrences of dependencies are in sync" feature of verifydependencies.go, and extends it with a way to check if the current version is up-to-date with its upstream (which could be releases in a Github repo, a Helm chart...). Upstreams are based on a plugin system, so more types of upstreams can be added as desired.

Let me know if this is something that could be of interest for the k8s project, I'd be happy to help with the integration (which should be pretty much drop-in). :)

Cheers, Florent Delannoy

What do we think about using zeitgeist?

cc: @dims @cblecker @BenTheElder @liggitt

BenTheElder commented 4 years ago

Other than vendor/ I don't think the original post makes much sense with code like kuebadm going out of tree.

etcd is not specified in tree by anything other than cluster provisioning tooling, which we have issues open about removing from the tree.

vendor/ already has an established dependency review system, and I don't think it needs any new tooling.

what other dependencies are we talking about?

BenTheElder commented 4 years ago

I would also note that in order to maintain a tool that brings up clusters you pretty much need the freedom to update dependencies at will. We do not force all cluster tools to synchronize on some specific version of e.g. containerd today, and I would not be in favor of doing so in the future.

tpepper commented 4 years ago

I agree with @BenTheElder that users (and vendors) need the ability to override project preferred defaults, if that's what was stated ;) My primary point is we need a stronger definition of "project preferred defaults". We do have these sprinkled around the code. We do bring up clusters, intentionally with certain components and component versions, and run tests with intention of proving specific combinations. We observe and fix real bugs relative to specific external non-golang dependency name/version/release tuples.

At some point we, the collective us as a community, need to understand what we're engineering, coding and testing against, and giving "support". IMO we should do that more strongly. (Also am open to conversation around if we could also not do that and assume vendors will manage that in a sufficiently coherent way, or expect that the dependencies don't have incompatible skews.)

From the older example linked above, there was a time where we tried to track more: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#dependencies

Since then the list of non-go-modules dependencies which are tracked is down to golang and etcd: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#dependencies If I were to read the difference between that 1.15 and 1.17 list, might I infer that Kubernetes 1.17 and higher now run fine with any cri-tools, cluster autoscaler, cadvisor, CNI, CSI, klog, etc. I'd love for the ecosystem of projects to be stable enough that we don't need to actively track in detail. Yet patches to some of those dependencies' version-in-use are frequently proposed for cherry-pick on release branches, which I take as evidence we do seem to track.

Another point that changelog shows is that we don't have a canonical source of truth. The user-focused message there is coming from the long series of commits and gives the sum of those (in arbitrary order?):

Update etcd client side to v3.4.3 (#83987, @wenjiaswe)
Kubernetes now requires go1.13.4+ to build (#82809, @liggitt)
Update to use go1.12.12 (#84064, @cblecker)
Update to go 1.12.10 (#83139, @cblecker)
Update default etcd server version to 3.4.3 (#84329, @jingyih)

Also interesting to me: K3s proves one may not even need etcd at all (or via a small patch anyway, and with a set of caveats on runtime robustness).

The kubeadm departure from k/k is an interesting case. Since they intend to branch and version with k/k, are they implicitly following any k/k's implicit dependencies? They (and other installers) do actively track some dependencies.

I'm open to arguments that it's not on us to manage. As the monolith splits and toward a more loosely coupled future, can we argue there is no longer a need for common base expectations? To me the split feels like it makes worse the potential for unmanaged risk of implicit dependencies and end-user confusion.

neolit123 commented 4 years ago

The kubeadm departure from k/k is an interesting case. Since they intend to branch and version with k/k, are they implicitly following any k/k's implicit dependencies?

yes, unless something is broken.

They (and other installers) do actively track some dependencies.

most installers usually trail behind.

BenTheElder commented 4 years ago

+1 @neolit123

@tpepper :

My primary point is we need a stronger definition of "project preferred defaults". We do have these sprinkled around the code. We do bring up clusters, intentionally with certain components and component versions, and run tests with intention of proving specific combinations. We observe and fix real bugs relative to specific external non-golang dependency name/version/release tuples.

IMO project preferred defaults is a problematic topic for political rather than technical reasons.

Are we going to start advertising preferred CRI and CNI ...?

At some point we, the collective us as a community, need to understand what we're engineering, coding and testing against, and giving "support". IMO we should do that more strongly.

We don't provide support for external tools. Doing so is perhaps not the best idea.

Complete solutions like kops, minikube, kind etc. do package some external tools necessarily and provide their own support there, but for kubernetes to do so seems like a mis-step unless we're prepared to pick a favorite for each option...

If I were to read the difference between that 1.15 and 1.17 list, might I infer that Kubernetes 1.17 and higher now run fine with any cri-tools, cluster autoscaler, cadvisor, CNI, CSI, klog, etc. I'd love for the ecosystem of projects to be stable enough that we don't need to actively track in detail. Yet patches to some of those dependencies' version-in-use are frequently proposed for cherry-pick on release branches, which I take as evidence we do seem to track.

Cluster autoscaler should advertise it's own compatibility with kubernetes and not vice versa, as should CNI implementations and CSI implementations etc. klog is ??? not an issue??

Also interesting to me: K3s proves one may not even need etcd at all (or via a small patch anyway, and with a set of caveats on runtime robustness).

Zero patches away, you can "simply" implement the etcd wire protocol but there are some problems there that I'd rather discuss in another forum :+)

tpepper commented 4 years ago

IMO project preferred defaults is a problematic topic for political rather than technical reasons.

Are we going to start advertising preferred CRI and CNI ...?

In as much as there are classes of interfaces or providers, as an open source project with limited resources I feel like we have a few paths:

  1. Treat one as a reference implementation with NVR declared programmatically and used consistently in test variations. If politics is the worry, this could be worst case. Drop the admittedly awkward word "preferred" and with an open mind this is the easiest for us to rationalize that our project is functional when integrated with something. Simplest test matrix. As long as some of us can keep the one thing running, we're good. In the face of an issue, an {alternate?, non-preferred?, non-default?, something politically correct?) provider has the onus to demonstrate whether the issue is theirs or upstream, fix their issues and get involved in upstream ones where applicable on behalf of their customers.
  2. Have multiple in test. Better than option 1 as it gives A/B comparison across reference implementations. Slippery slope of test matrix. Still has politics: "You let T, U, V in, therefore now you must take a patch for my X too". Odds become good that some of these aren't actually kept in a working state. We spend a lot of cycles trying to specialize beyond our specialization to understand vendor specifics. Can our community rationalize much about a bug that only shows in some deployment variations? Or meh the vendors will fix it and let us know if not? This gets expensive in CI. This gets expensive in human dev/test/debug time. Overall CI health is fuzzy on average.
  3. Have all variations in test. No politics here. Are they all actually in a working state though? Can our community rationalize much about a bug that only shows on some of them? If somebody wants to pay for the CI and staff the engineers...

Seriously though the latter is obviously highly unlikely to happen. The middle is where we are now. The first is simpler but for the choice of which.

At some point we, the collective us as a community, need to understand what we're engineering, coding and testing against, and giving "support". IMO we should do that more strongly.

We don't provide support for external tools. Doing so is perhaps not the best idea.

We don't support external tools, but we debug problem reports. We support our code running in conjunction with external components both in CI and we welcome end-users' problem reports. That requires our finite resources have an understanding of and ability to debug a not-very-finite set of runtime combinations. Can we actively manage that complexity or must it be a free for all?

I feel like if we declare the things we run in test, do that in common (across the org?), reduce the size of the test matrix, then we can have more realistic conversations about what is containable beyond a simple, common short list of variations and at what cost. We feel out of balance and unsustainable where we are today.

tpepper commented 4 years ago

Relates IMO partly to conversation in https://github.com/kubernetes/test-infra/issues/18551 and https://github.com/kubernetes/sig-release/issues/966 around establishing more clean test plan.

justaugustus commented 2 years ago

/unassign /help

k8s-ci-robot commented 2 years ago

@justaugustus: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/sig-release/issues/601): >/unassign >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.