karlkfi commented 2 years ago

Problems

The applier/destroyer pipelines add Apply/Prune tasks with Wait tasks, waiting for each apply/prune group to become Current (after apply) or NotFound (after prune). This behavior coupled with complex resource sets with multiple dependency branches has the following impact:

apply/prune can be blocked for some object even if their dependencies are ready
blocking waits for the slowest reconciliation in the previous apply phase, even if it doesn't time out
because wait timeout currently causes the pipeline to terminate, any object that cause wait timeout blocks all objects in subsequent phases from being applied/pruned

Example 1

For example, here's an infra dependency chain with branches (just deploying two GKEs with a CC cluster):

namespace
- rbac
- GKE cluster1
- node pool1 (depends-on cluster1)
- GKE cluster2
- node pool2 (depends-on cluster2)

If any cluster fails to apply (ex: error in config), then both node pools are blocked.

Example 2

Another example is just using the same apply for multiple namespaces or CRDs and namespace:

CRD
namespace1
- deployment1
namespace2
- deployment2

If any CRD or any namespace fails to apply (ex: blocked by policy webhook or config error), then all the deployments and everything in the namespaces are blocked.

Possible solutions

Continue on wait timeout

This helps with problem 1 and 3, but not problem 2, and has the consequence of making failure take longer, because all resources will be applied, even if we know their dependency isn't ready.

Dependency filter

This would help remediate the side effect of "continue on wait timeout" by skipping apply/prune for objects with unreconciled dependencies
This would need to be used on both apply and prune
This would need to share logic with the graph sorter, which currently handles identifying dependencies (depends-on, apply-time-mutation, CRDs, and namespaces)

Parallel apply

Applying objects in parallel during each Apply Task (and deleting in parallel for prune tasks) would speed up e2e apply time significantly, helping to mitigate dependency cost, but not actually solving either problem 1 or problem 2

Async (graph) apply

Building a branching pipeline, instead of flattening into a single synchronous pipeline would help isolate dependencies to only block things that depend on them.
This is probably the best solution, but also the most complex and risky.
This would probably require changing how events are tested/used. Consuming them in a single for loop might not be the best strategy any more; async listeners might be better.
This would make both Dependency filter and Parallel apply obsolete.
This might also make Continue on wait timeout obsolete, because there might be more than one task branch executing and one could terminate even if the others continue.
This is the only proposal that solve problem 2, which is going to become a bigger problem as soon as people start using depends-on and apply-time-mutation on infra resources (like GKE clusters) that take a long time to reconcile.
This solution might make it easier to add support for advanced features like blocking on Job completion or lifecycle directives (ex: delete before apply)

karlkfi commented 2 years ago

@mortent @haiyanmeng @seans3 I've tried to document the motivation behind these proposals we discussed, but haven't gone into a lot of detail about how they might work.

Should we tackle all of them in top down order? Or should we skip straight to the hardest one which makes the others obsolete? Got other ideas that would resolve the same issues?

mortent commented 2 years ago

So I don't think we should do "Continue on wait timeout" (where we essentially continue applying all resources after a timeout) as the default at all. It effectively breaks depends-on and apply-time mutations.

The dependency filter seems like a good temporary fix if we feel async apply is complex. It should have correct behavior, although there are some drawbacks.

The parallel apply is ok if we think it significantly improves performance. Otherwise I'm not convinced it is worth the effort.

Long term I think the async apply is the best solution. It does address several issues with both behavior and performance. It probably needs a design doc to get into some more detail about how it should work.

seans3 commented 2 years ago

Morten is wrong, and we should "Continue on wait timeout". While it does not work for depends-on and apply-time mutations, these are ALREADY BROKEN. We already continue on error--just not for wait timeouts. This change would NOT make the current situation worse. But it will be strictly better than the current task execution stoppage for a wait timeout. And it will fix the long-standing bug of not running the final inventory task on wait timeout.

The correct error handling for depends on and apply time mutation is something we should work for. But in the short term "continuing on wait timeout" is clearly better than the current state. I will be merging it shortly.

As far as the other priorities, I believe the async apply, since it will the the longest and hardest should not be our first effort.

mortent commented 2 years ago

So the PR in question doesn't make it the default, so I'm ok with that change. It has another issue around how we use the error event that I have pointed out on the PR.

I don't disagree that some of the behavior is broken today, I just think we should do the more permanent (since we all agree the async is a bigger change) fix with the dependency filter. The current behavior has been that way for a while, so I think we can take the time to fix it properly.

karlkfi commented 2 years ago

Per discussion elsewhere, we decided to change the behavior with https://github.com/kubernetes-sigs/cli-utils/pull/451 to always continue on wait timeouts (and other apply/destory errors, as was previous behavior).

This is a change in behavior, but the previous behavior was considered broken (or at least internally inconsistent).

Work on Dependency Filter and Parallel Apply should be unblocked. Work on them is mostly a question of priority. It sounds like there's disagreement on whether to do those incremental changes first or just skip to the Async Apply.

karlkfi commented 2 years ago

Status update...

Continue on wait timeout has already been implemented:

We also want to add an optional Continue on Validation Error feature as part of the WIP validation redesign: https://github.com/kubernetes-sigs/cli-utils/pull/488

Dependency Filter is planned and will become more urgent once we have Continue on Validation Error.

Parallel Apply is also planned (Q1 2022), but is primarily a speed concern, not a correctness concern.

Async (graph) apply will also hopefully be eventually implemented, because it solves multiple problems, but it hasn't been scheduled yet, and may take a long time to get right.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

karlkfi commented 2 years ago

Remaining fixes are still TODO

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

karlkfi commented 2 years ago

/remove-lifecycle rotten /lifecycle frozen

FWIW, the dependency filter was added. Also, now that QPS has been disabled there’s not much value in adding parallel apply unless you want to add even more server load. The ideal solution will be to eventually rewrite the apply scheduler to use an asynchronous graph.

Unfortunately, it’s not high on the priority list yet, because of other issues. But hopefully we’ll get around to it soon.

kubernetes-sigs / cli-utils

Bug: dependency waiting blocks too aggressively #455

Problems

Example 1

Example 2

Possible solutions

Continue on wait timeout

Dependency filter

Parallel apply

Async (graph) apply