kubernetes-sigs / cli-utils

This repo contains binaries that built from libraries in cli-runtime.
Apache License 2.0
155 stars 78 forks source link

feat: replace StatusPoller w/ StatusWatcher #572

Closed karlkfi closed 2 years ago

karlkfi commented 2 years ago

feat: replace StatusPoller w/ StatusWatcher

BREAKING CHANGE: Replace StatusPoller w/ StatusWatcher BREAKING CHANGE: Remove PollInterval (obsolete with watcher)

karlkfi commented 2 years ago

I still have some tests to write and refactoring to do, but I got all the tests passing locally.

karlkfi commented 2 years ago

/test cli-utils-presubmit-master-stress

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/574

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/575

karlkfi commented 2 years ago

/test cli-utils-presubmit-master-stress

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/578

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/576

karlkfi commented 2 years ago

Extracted: https://github.com/kubernetes-sigs/cli-utils/pull/579

karlkfi commented 2 years ago

This is finally ready for review.

I've added unit tests for the new watcher, and all the existing e2e/stress tests pass. Notably, the stress tests are ~30% faster (17m -> 12m)!

karlkfi commented 2 years ago

heh. that's the flaky test i just fixed with https://github.com/kubernetes-sigs/cli-utils/pull/581

karlkfi commented 2 years ago

There were some concerns about memory usage with informers, so I ran some stress tests.

The times are from CI (includes some apt-gets and kind cluster creation). The memory profiling is with the new ginkgo v2 which allows exporting pprof (like go test does):

$ ginkgo -v -memprofile pprof.mem -output-dir ./.out/ ./test/stress/... -- -v 5
...
$ go tool pprof --alloc_space ./.out/test_stress.test ./.out/test_stress_pprof.mem
(pprof) top

Time Spent (20 QPS)

Polling: 16m50s Watching: 12m1s Diff: - 4m49s (- 28.6%)

Time Spent (-1 QPS)

Polling: 7m0s Watching: 6m28s Diff: - 32s (- 7.7%)

These shorter times seem to fluctuate up and down due to other factors (like kind cluster deployment), so the speed diff with throttling disabled on an otherwise empty local cluster might not be significant.

Memory Allocations (-1 QPS)

Polling: 4745.35MB Watching: 1761.93MB Diff: - 2983.42 (- 62.8%)

Liujingfang1 commented 2 years ago

@karlkfi thank you for testing the time and memory allocation difference.

The change looks good to me overall. Would like @mortent also take a look.

/lgtm

karlkfi commented 2 years ago

Summarizing line-comments for posterity:

karlkfi commented 2 years ago

/test cli-utils-presubmit-master

karlkfi commented 2 years ago

/hold

looks like the watcher tests are flaky. I was able to finally reproduce a similar error locally after a few dozen attempts. It sometimes exists without any events and no error saying why.

karlkfi commented 2 years ago

Ah, important similarity: when there are no events it also timed out after 10s+

ash2k commented 2 years ago

I still couldn't find time to review this, but based on description I have these concerns:

p.s. thank you for taking on such massive tasks like this one!

karlkfi commented 2 years ago

@ash2k: Good catch. Permissions are gonna be a problem.

I had specifically avoided making an Informer for each namespace because it's much more expensive & slow, but I might have to to support clients with less permissions. It might still be more efficient than the Poller, but not by as much.

Unfortunately, all the tests we have in this repo use cluster-admin, so doing cluster-wise ListAndWatch worked fine. I'm gonna need to add some more granular setup to test having limited permissions.

As-is, DynamicStatusWatcher takes a specific ObjMetadataSet to watch. And while the DynamicStatusWatcher can be re-used, it doesn't share the informers between Watch calls.

The best option I can think of is to have the caller construct either a cluster-scoped watcher (more efficient) or a namespace-scoped watcher (fewer required permissions). This would allow cluster-scoped appliers and namespace-scoped appliers to be equally efficient, but it wouldn't allow for one applier to deploy to multiple namespaces without cluster-wide LIST/WATCH permission.

Would that be good enough for you? Or do you need a single applier to be able to deploy to multiple namespaces without cluster-wide LIST/WATCH permission?

FWIW, in Config Sync, the standard installation spins up a new reconciler (wrapping the cli-utils applier) for each RootSync/RepoSync. When it's for a RootSync, it needs cluster-admin, but when it's for a RepoSync it only needs namespace-admin. So we definitely need to support at least those use cases.

As for sharing Watchers between Appliers, I hadn't really considered that, but it's similar to what @Liujingfang1 wanted to do in Config Sync to share the Watcher with another component. I think we would need two separate layers for that: one that starts/stops informers for all the required resource types in the required scopes, and another that calculates the status. That might be a cleaner design than doing both together, but it's gonna be a fair amount of work to rebuild.

ash2k commented 2 years ago

@karlkfi

Would that be good enough for you? Or do you need a single applier to be able to deploy to multiple namespaces without cluster-wide LIST/WATCH permission?

Yes, I think allowing cli-utils consumer (me) to select between cluster-wide or namespaced mode is a good enough solution. Similarly, an app can let the user decide which of the approaches they need. It's a simple model and I like that.

I think we would need two separate layers for that: one that starts/stops informers for all the required resource types in the required scopes, and another that calculates the status.

Yes, I think that would make sense. It could be a wrapper around informer factory that is doing "reference counting" for each informer or something like that. Maybe a new custom interface would be better. Also, this feels like what informer factory should already be, but it's dumber than that. Such "informer factory v2" would be great to have upstream - personally I needed it a few times over the years.

karlkfi commented 2 years ago

Another dilemma: The current design doesn't work if you're applying a CR AND it's CRD is created async by another process AND there are no other CRDs in the apply set (like a single-namespace applier). Since the watcher wouldn't be watching CRDs, the CR informer would be stopped when it errors with NotFound, but never restarted.

I think I'll have to put some sort of retry-backoff loop in there to re-create the informer.

On the plus side, the Applier already rejects this use case when validating the object set. But we probably actually want to allow it, because it would fix a common problem where the applier doesn't work when the CRD is created indirectly through another object being created (ex: gatekeeper ConstraintTemplates dynamically create CRDs for Constraints).

That said, I'm not sure if there's some other reason we need the mapping up front, or whether it can be punted until apply time.

karlkfi commented 2 years ago

Alright, I wrote a while new stack of abstractions bottom up... but I couldn't get it working. So after a few days of that, I gave up, took what I learned, and went back and modified the original design that I had here. This updated design is similar but more refined and solves the issues we brought up in comments (I hope).

The biggest change is that there is now an Options struct with a RESTScopeStrategy field, which takes 3 values: root, namespace, and automatic (default). In automatic mode, root-scope is used if the set of objects contains both root-scoped & namespace-scoped resources or multiple namespaces. This should work fine, as discussed, but if we need to allow the caller to force namespace-mode (because they want to apply to multiple namespaces without cluster-admin), we should be able to easily route the strategy option down to the Applier/Destroyer Options in the future.

Also, to make sure that the DynamicStatusReader wasn't gonna be a huge scale problem without caching or listing, I wrote another stress test that deploys 1,000 Deployments with 1 pod each. This means that each Deployment status lookup needs to make another couple GETs to get the ReplicaSet and Pod status as well. We can optimize this in the future, but it seems to work ok for now. Because of this change, I had to add a kind config that makes 10 worker nodes, instead of just 1. That was the easiest way to support 1,000 pods without a bunch of fiddling with IP ranges and resource capacity. I'm hoping this works in CI...

karlkfi commented 2 years ago

Oh, one other big change: Because of some flaky tests, I figured out that the Informer was losing events, because it wasn't being fully synchronized before objects were being deleted. So the delete update never came and the prune wait task hangs. The workaround was to route through the Informer's HasSynced function and wrap it to send a SyncEvent on the status event channel, then make the TaskRunner block tasks until the StatusWatcher was synchronized (meaning all the started informers are synced). Hopefully this doesn't become a hazard at scale. But I think with the new optimization to use root-scoped LISTs there are MANY FEWER lists calls made in large cluster scenarios. Plus, the Informer automatically paginates them, which can help avoid locking up etcd and ruining performance. So even with the task start delay, the end-to-end time should still be much faster.

karlkfi commented 2 years ago

Updated profiling numbers...

Time Spent (-1 QPS)

(Unlike the previous perf numbers, these times are just the Ginkgo test run. It doesn't include spinning up the kind cluster.)

Polling: 3m25s Watching: 3m12s Diff: - 13s (- 6.34%)

Memory Allocations (-1 QPS)

Polling: 7.34 GB Watching: 1.60 GB Diff: - 5.74 (- 78.2%)

karlkfi commented 2 years ago

/unhold

This is ready for review again!

karlkfi commented 2 years ago

That’s another flaky test from the StatusWatcher being so fast the Pending WaitEvent was skipped. Easy fix is to pull out the Pending events as optional, as I’ve done elsewhere. Will fix tomorrow.

karlkfi commented 2 years ago

Fixed the flaky test with a new FilterOptionalEvents added throughout the e2e tests.

karlkfi commented 2 years ago

Comments addressed. Tests passed. Please take another look.

Liujingfang1 commented 2 years ago

/lgtm

karlkfi commented 2 years ago

/hold

Block until we get a release with k8s v1.24 out.

karlkfi commented 2 years ago

This PR adds quite a few TODOs. Are you planning to address those later?

Honestly, if the watcher works well, I don't think I'll get back around to the TODOs for a while. The main goal here is performance/responsiveness and reduced memory use, and I don't want perfect to be the enemy of good. This codebase already has a pile of tech debt and the TODO issues I've made keep getting auto-closed, so I figured TODOs in the code were at least a little more persistent and might be discovered again, the next time someone is in this code.

I will address some of your comments tho. Thanks for reviewing!

k8s-ci-robot commented 2 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ash2k, karlkfi, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/cli-utils/blob/master/OWNERS)~~ [mortent] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
karlkfi commented 2 years ago

/unhold

v0.30.0 passed all tests in Config Sync, so this this unblocked for merge.