karlkfi commented 2 years ago

feat: replace StatusPoller w/ StatusWatcher

Add DefaultStatusWatcher that wraps DynamicClient and manages informers for a set of resource objects.
- Supports two modes: root-scoped & namespace-scoped.
- Root-scoped mode uses root-scoped informers to efficiency and performance.
- Namespace-scoped mode uses namespace-scoped informers to minimize the permissions needed to run and the size of the in-memory object cache.
- Automatic mode selects which mode to use based on whether the objects being watched are in one or multiple namespaces. This is the default mode, optimizing for performance.
- If CRDs are being watched, the creation/deletion of CRDs can cause informers for those custom resources to be created/deleted.
- In namespace-scope mode, if namespaces are being watched, the creation/deletion of namespaces can also trigger informers to be created/deleted.
- All creates/updates/deletes to CRDs also cause RESTMapper reset.
- Allow pods to be unschedulable for 15s before reporting the status as Failed. Any update resets the timer.
Add BlindStatusWatcher for testing and disabling for dry-run.
Add DynamicClusterReader that wraps DynamicClient. This is now used to look up generated resources (ex: Deployment > ReplicaSets > Pods).
Add DefaultStatusReader which uses a DelegatingStatusReader to wrap a list of conventional and specific StatusReaders. This should make it easier to extend the list of StatusReaders.
Move some pending WaitEvents to be optional in tests, now that StatusWatcher can resolve their status before the WaitTask starts.
Add a new Thousand Deployments stress test (10x kind nodes)
Add some new logs for easier debugging
Add internal SyncEvent so that apply/delete tasks don't start until the StatusWatcher has finished initial synchronization. This helps avoid missing events from actions that happen while synchronization is incomplete.
Filter optional pending WaitEvents when testing.

BREAKING CHANGE: Replace StatusPoller w/ StatusWatcher BREAKING CHANGE: Remove PollInterval (obsolete with watcher)

karlkfi commented 2 years ago

I still have some tests to write and refactoring to do, but I got all the tests passing locally.

karlkfi commented 2 years ago

/test cli-utils-presubmit-master-stress

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/574

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/575

karlkfi commented 2 years ago

/test cli-utils-presubmit-master-stress

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/578

karlkfi commented 2 years ago

Extracted https://github.com/kubernetes-sigs/cli-utils/pull/576

karlkfi commented 2 years ago

Extracted: https://github.com/kubernetes-sigs/cli-utils/pull/579

karlkfi commented 2 years ago

This is finally ready for review.

I've added unit tests for the new watcher, and all the existing e2e/stress tests pass. Notably, the stress tests are ~30% faster (17m -> 12m)!

karlkfi commented 2 years ago

heh. that's the flaky test i just fixed with https://github.com/kubernetes-sigs/cli-utils/pull/581

karlkfi commented 2 years ago

There were some concerns about memory usage with informers, so I ran some stress tests.

The times are from CI (includes some apt-gets and kind cluster creation). The memory profiling is with the new ginkgo v2 which allows exporting pprof (like go test does):

$ ginkgo -v -memprofile pprof.mem -output-dir ./.out/ ./test/stress/... -- -v 5
...
$ go tool pprof --alloc_space ./.out/test_stress.test ./.out/test_stress_pprof.mem
(pprof) top

Time Spent (20 QPS)

Polling: 16m50s Watching: 12m1s Diff: - 4m49s (- 28.6%)

Time Spent (-1 QPS)

Polling: 7m0s Watching: 6m28s Diff: - 32s (- 7.7%)

These shorter times seem to fluctuate up and down due to other factors (like kind cluster deployment), so the speed diff with throttling disabled on an otherwise empty local cluster might not be significant.

Memory Allocations (-1 QPS)

Polling: 4745.35MB Watching: 1761.93MB Diff: - 2983.42 (- 62.8%)

Liujingfang1 commented 2 years ago

@karlkfi thank you for testing the time and memory allocation difference.

The change looks good to me overall. Would like @mortent also take a look.

/lgtm

karlkfi commented 2 years ago

Summarizing line-comments for posterity:

I added code to handle delete & re-add of CRDs, but that's not a scenario we're likely to see during an apply/destroy, unless it's caused by a 3rd party, because the current task queue always deletes after applying. The existing e2e tests do prove that adding and deleting CRDs with CRs in the same sync works tho.
CRD version upgrades are probably going to be ignored by the watcher. It should continue returning the GVK it was originally asked for, which is the default version at the beginning of the watch, or whenever the CRD is first created. We're not doing much with the response object other than checking status and standard metadata fields, so this shouldn't be a big deal, but it could mean that new fields used for checking status won't be available until the next sync after a CRD is updated. We can improve this behavior later.
The DynamicClusterReader may not be generally useful outside of the DynamicStatusWatcher, because it doesn't support specifying the version to GET/LIST. The fact that it looks up mapping every time is also more expensive than might be necessary. The ClusterReader API itself is not really ideal, but already used by the Poller, so it was used for the Watcher for consistency. There are lots of potential enhancements we could do here, but they may require changes to (or removal of) the Poller as well.

karlkfi commented 2 years ago

/test cli-utils-presubmit-master

karlkfi commented 2 years ago

/hold

looks like the watcher tests are flaky. I was able to finally reproduce a similar error locally after a few dozen attempts. It sometimes exists without any events and no error saying why.

karlkfi commented 2 years ago

Ah, important similarity: when there are no events it also timed out after 10s+

ash2k commented 2 years ago

I still couldn't find time to review this, but based on description I have these concerns:

Currently cli-utils only accesses objects that the user works with. With the informer model it will require reading/watching same GKs but across all namespaces. Users of a consumer program may not have permissions for that. It's typical to have permissions for a single namespace so cli-utils will not work in this situation, right? i.e. the new version requires more permissions. What can we do here? Maybe detect if all objects are within a namespace? Or maybe cli-utils can expose a flag to let the app/user configure it to use per-namespace informers? This also may be desired for huge clusters where it's better to watch several individual namespaces rather than the whole lot.
My app uses multiple (0 or more, depending on user-supplied config) appliers. Can I share the informers among them? Sorry, still haven't looked at the code. Presumably DynamiciStatusWatcher is exactly for that purpose (one instance can be supplied to multiple appliers/etc), but I wanted to confirm.

p.s. thank you for taking on such massive tasks like this one!

karlkfi commented 2 years ago

@ash2k: Good catch. Permissions are gonna be a problem.

I had specifically avoided making an Informer for each namespace because it's much more expensive & slow, but I might have to to support clients with less permissions. It might still be more efficient than the Poller, but not by as much.

Unfortunately, all the tests we have in this repo use cluster-admin, so doing cluster-wise ListAndWatch worked fine. I'm gonna need to add some more granular setup to test having limited permissions.

As-is, DynamicStatusWatcher takes a specific ObjMetadataSet to watch. And while the DynamicStatusWatcher can be re-used, it doesn't share the informers between Watch calls.

The best option I can think of is to have the caller construct either a cluster-scoped watcher (more efficient) or a namespace-scoped watcher (fewer required permissions). This would allow cluster-scoped appliers and namespace-scoped appliers to be equally efficient, but it wouldn't allow for one applier to deploy to multiple namespaces without cluster-wide LIST/WATCH permission.

Would that be good enough for you? Or do you need a single applier to be able to deploy to multiple namespaces without cluster-wide LIST/WATCH permission?

FWIW, in Config Sync, the standard installation spins up a new reconciler (wrapping the cli-utils applier) for each RootSync/RepoSync. When it's for a RootSync, it needs cluster-admin, but when it's for a RepoSync it only needs namespace-admin. So we definitely need to support at least those use cases.

As for sharing Watchers between Appliers, I hadn't really considered that, but it's similar to what @Liujingfang1 wanted to do in Config Sync to share the Watcher with another component. I think we would need two separate layers for that: one that starts/stops informers for all the required resource types in the required scopes, and another that calculates the status. That might be a cleaner design than doing both together, but it's gonna be a fair amount of work to rebuild.

ash2k commented 2 years ago

@karlkfi

Would that be good enough for you? Or do you need a single applier to be able to deploy to multiple namespaces without cluster-wide LIST/WATCH permission?

Yes, I think allowing cli-utils consumer (me) to select between cluster-wide or namespaced mode is a good enough solution. Similarly, an app can let the user decide which of the approaches they need. It's a simple model and I like that.

I think we would need two separate layers for that: one that starts/stops informers for all the required resource types in the required scopes, and another that calculates the status.

Yes, I think that would make sense. It could be a wrapper around informer factory that is doing "reference counting" for each informer or something like that. Maybe a new custom interface would be better. Also, this feels like what informer factory should already be, but it's dumber than that. Such "informer factory v2" would be great to have upstream - personally I needed it a few times over the years.

karlkfi commented 2 years ago

Another dilemma: The current design doesn't work if you're applying a CR AND it's CRD is created async by another process AND there are no other CRDs in the apply set (like a single-namespace applier). Since the watcher wouldn't be watching CRDs, the CR informer would be stopped when it errors with NotFound, but never restarted.

I think I'll have to put some sort of retry-backoff loop in there to re-create the informer.

On the plus side, the Applier already rejects this use case when validating the object set. But we probably actually want to allow it, because it would fix a common problem where the applier doesn't work when the CRD is created indirectly through another object being created (ex: gatekeeper ConstraintTemplates dynamically create CRDs for Constraints).

That said, I'm not sure if there's some other reason we need the mapping up front, or whether it can be punted until apply time.

karlkfi commented 2 years ago

Alright, I wrote a while new stack of abstractions bottom up... but I couldn't get it working. So after a few days of that, I gave up, took what I learned, and went back and modified the original design that I had here. This updated design is similar but more refined and solves the issues we brought up in comments (I hope).

The biggest change is that there is now an Options struct with a RESTScopeStrategy field, which takes 3 values: root, namespace, and automatic (default). In automatic mode, root-scope is used if the set of objects contains both root-scoped & namespace-scoped resources or multiple namespaces. This should work fine, as discussed, but if we need to allow the caller to force namespace-mode (because they want to apply to multiple namespaces without cluster-admin), we should be able to easily route the strategy option down to the Applier/Destroyer Options in the future.

Also, to make sure that the DynamicStatusReader wasn't gonna be a huge scale problem without caching or listing, I wrote another stress test that deploys 1,000 Deployments with 1 pod each. This means that each Deployment status lookup needs to make another couple GETs to get the ReplicaSet and Pod status as well. We can optimize this in the future, but it seems to work ok for now. Because of this change, I had to add a kind config that makes 10 worker nodes, instead of just 1. That was the easiest way to support 1,000 pods without a bunch of fiddling with IP ranges and resource capacity. I'm hoping this works in CI...

karlkfi commented 2 years ago

Oh, one other big change: Because of some flaky tests, I figured out that the Informer was losing events, because it wasn't being fully synchronized before objects were being deleted. So the delete update never came and the prune wait task hangs. The workaround was to route through the Informer's HasSynced function and wrap it to send a SyncEvent on the status event channel, then make the TaskRunner block tasks until the StatusWatcher was synchronized (meaning all the started informers are synced). Hopefully this doesn't become a hazard at scale. But I think with the new optimization to use root-scoped LISTs there are MANY FEWER lists calls made in large cluster scenarios. Plus, the Informer automatically paginates them, which can help avoid locking up etcd and ruining performance. So even with the task start delay, the end-to-end time should still be much faster.

karlkfi commented 2 years ago

Updated profiling numbers...

Time Spent (-1 QPS)

(Unlike the previous perf numbers, these times are just the Ginkgo test run. It doesn't include spinning up the kind cluster.)

Polling: 3m25s Watching: 3m12s Diff: - 13s (- 6.34%)

Memory Allocations (-1 QPS)

Polling: 7.34 GB Watching: 1.60 GB Diff: - 5.74 (- 78.2%)

karlkfi commented 2 years ago

/unhold

This is ready for review again!

karlkfi commented 2 years ago

That’s another flaky test from the StatusWatcher being so fast the Pending WaitEvent was skipped. Easy fix is to pull out the Pending events as optional, as I’ve done elsewhere. Will fix tomorrow.

karlkfi commented 2 years ago

Fixed the flaky test with a new FilterOptionalEvents added throughout the e2e tests.

karlkfi commented 2 years ago

Comments addressed. Tests passed. Please take another look.

Liujingfang1 commented 2 years ago

/lgtm

karlkfi commented 2 years ago

/hold

Block until we get a release with k8s v1.24 out.

karlkfi commented 2 years ago

This PR adds quite a few TODOs. Are you planning to address those later?

Honestly, if the watcher works well, I don't think I'll get back around to the TODOs for a while. The main goal here is performance/responsiveness and reduced memory use, and I don't want perfect to be the enemy of good. This codebase already has a pile of tech debt and the TODO issues I've made keep getting auto-closed, so I figured TODOs in the code were at least a little more persistent and might be discovered again, the next time someone is in this code.

I will address some of your comments tho. Thanks for reviewing!

k8s-ci-robot commented 2 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ash2k, karlkfi, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/cli-utils/blob/master/OWNERS)~~ [mortent] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

karlkfi commented 2 years ago

/unhold

v0.30.0 passed all tests in Config Sync, so this this unblocked for merge.

kubernetes-sigs / cli-utils

feat: replace StatusPoller w/ StatusWatcher #572

Time Spent (20 QPS)

Time Spent (-1 QPS)

Memory Allocations (-1 QPS)

Time Spent (-1 QPS)

Memory Allocations (-1 QPS)