kubernetes-sigs / cli-utils

This repo contains binaries that built from libraries in cli-runtime.
Apache License 2.0
154 stars 77 forks source link

Bug: dependency waiting blocks too aggressively #455

Open karlkfi opened 2 years ago

karlkfi commented 2 years ago

Problems

The applier/destroyer pipelines add Apply/Prune tasks with Wait tasks, waiting for each apply/prune group to become Current (after apply) or NotFound (after prune). This behavior coupled with complex resource sets with multiple dependency branches has the following impact:

  1. apply/prune can be blocked for some object even if their dependencies are ready
  2. blocking waits for the slowest reconciliation in the previous apply phase, even if it doesn't time out
  3. because wait timeout currently causes the pipeline to terminate, any object that cause wait timeout blocks all objects in subsequent phases from being applied/pruned

Example 1

For example, here's an infra dependency chain with branches (just deploying two GKEs with a CC cluster):

If any cluster fails to apply (ex: error in config), then both node pools are blocked.

Example 2

Another example is just using the same apply for multiple namespaces or CRDs and namespace:

If any CRD or any namespace fails to apply (ex: blocked by policy webhook or config error), then all the deployments and everything in the namespaces are blocked.

Possible solutions

Continue on wait timeout

Dependency filter

Parallel apply

Async (graph) apply

karlkfi commented 2 years ago

@mortent @haiyanmeng @seans3 I've tried to document the motivation behind these proposals we discussed, but haven't gone into a lot of detail about how they might work.

Should we tackle all of them in top down order? Or should we skip straight to the hardest one which makes the others obsolete? Got other ideas that would resolve the same issues?

mortent commented 2 years ago

So I don't think we should do "Continue on wait timeout" (where we essentially continue applying all resources after a timeout) as the default at all. It effectively breaks depends-on and apply-time mutations.

The dependency filter seems like a good temporary fix if we feel async apply is complex. It should have correct behavior, although there are some drawbacks.

The parallel apply is ok if we think it significantly improves performance. Otherwise I'm not convinced it is worth the effort.

Long term I think the async apply is the best solution. It does address several issues with both behavior and performance. It probably needs a design doc to get into some more detail about how it should work.

seans3 commented 2 years ago

Morten is wrong, and we should "Continue on wait timeout". While it does not work for depends-on and apply-time mutations, these are ALREADY BROKEN. We already continue on error--just not for wait timeouts. This change would NOT make the current situation worse. But it will be strictly better than the current task execution stoppage for a wait timeout. And it will fix the long-standing bug of not running the final inventory task on wait timeout.

The correct error handling for depends on and apply time mutation is something we should work for. But in the short term "continuing on wait timeout" is clearly better than the current state. I will be merging it shortly.

As far as the other priorities, I believe the async apply, since it will the the longest and hardest should not be our first effort.

mortent commented 2 years ago

So the PR in question doesn't make it the default, so I'm ok with that change. It has another issue around how we use the error event that I have pointed out on the PR.

I don't disagree that some of the behavior is broken today, I just think we should do the more permanent (since we all agree the async is a bigger change) fix with the dependency filter. The current behavior has been that way for a while, so I think we can take the time to fix it properly.

karlkfi commented 2 years ago

Per discussion elsewhere, we decided to change the behavior with https://github.com/kubernetes-sigs/cli-utils/pull/451 to always continue on wait timeouts (and other apply/destory errors, as was previous behavior).

This is a change in behavior, but the previous behavior was considered broken (or at least internally inconsistent).

Work on Dependency Filter and Parallel Apply should be unblocked. Work on them is mostly a question of priority. It sounds like there's disagreement on whether to do those incremental changes first or just skip to the Async Apply.

karlkfi commented 2 years ago

Status update...

Continue on wait timeout has already been implemented:

We also want to add an optional Continue on Validation Error feature as part of the WIP validation redesign: https://github.com/kubernetes-sigs/cli-utils/pull/488

Dependency Filter is planned and will become more urgent once we have Continue on Validation Error.

Parallel Apply is also planned (Q1 2022), but is primarily a speed concern, not a correctness concern.

Async (graph) apply will also hopefully be eventually implemented, because it solves multiple problems, but it hasn't been scheduled yet, and may take a long time to get right.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

karlkfi commented 2 years ago

Remaining fixes are still TODO

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

karlkfi commented 2 years ago

/remove-lifecycle rotten /lifecycle frozen

FWIW, the dependency filter was added. Also, now that QPS has been disabled there’s not much value in adding parallel apply unless you want to add even more server load. The ideal solution will be to eventually rewrite the apply scheduler to use an asynchronous graph.

Unfortunately, it’s not high on the priority list yet, because of other issues. But hopefully we’ll get around to it soon.