kubernetes-sigs / cli-utils

This repo contains binaries that built from libraries in cli-runtime.
Apache License 2.0
155 stars 78 forks source link

Bug: implicit namespace resolution in apply-time-mutation causes dependency sorting failure #481

Open karlkfi opened 2 years ago

karlkfi commented 2 years ago

Resources:

Output (from kpt using cli-utils master):

$ kpt live apply
namespace/test unchanged
1 resource(s) applied. 0 created, 1 unchanged, 0 configured, 0 failed
pod/pod-a created
pod/pod-b apply failed: failed to mutate "test_pod-b__Pod" with "ApplyTimeMutator": failed to read field ($.status.podIP) from source resource (/namespaces/test/Pod/pod-a): expected 1 match, but found 0)
2 resource(s) applied. 1 created, 0 unchanged, 0 configured, 1 failed
error: 1 resources failed

Problem:

karlkfi commented 2 years ago

Explicit dep with depends-on works, so it's just apply-time-mutation that doesn't (weird!).

$ kpt live apply
namespace/test unchanged
1 resource(s) applied. 0 created, 1 unchanged, 0 configured, 0 failed
pod/pod-a created
1 resource(s) applied. 1 created, 0 unchanged, 0 configured, 0 failed
pod/pod-b created
1 resource(s) applied. 1 created, 0 unchanged, 0 configured, 0 failed
karlkfi commented 2 years ago

apply-time-mutation works if the source namespace is specified. So it's just the implicit namespace resolution that breaks graph sorting.

karlkfi commented 2 years ago

Root cause: graph.Sort only uses mutation.ReadAnnotation, which returns an empty Namespace, if not specified by the user. The implicit namespace resolution is performed in ApplyTimeMutator.Mutate.

Unfortunately, the easy solution of moving implicit namespace resolution into mutation.ReadAnnotation doesn't work, because it needs to know whether the Source and Target are both namespaced. This would require using the mapper to resolve the resource schemas. If we try to use the mapper in graph.Sort it would fail for resources that haven't had their CRD applied yet.

So this is another issue with mapping lookups happening up front instead of lazily as-needed. Changing that, to resolve sort order after every apply, would be a significant change...

Also unfortunate is that we can't reject SourceRefs without an explicit namespace without knowing if the resource is namespaced or not...

karlkfi commented 2 years ago

Possible workarounds:

  1. Resolve mapping up front, if possible, and error at apply time, if not
    • Add a ResourceReference field to specify if the resource is cluster or namespace scoped.
    • Perform implicit namespace resolution before sorting (to avoid needing to inject the mapper into graph.SortObjs).
    • If there's no schema and the namespace is empty and the resource is namespace scoped, it will still error at apply time, but all other cases will work as expected.
  2. Disable implicit namespace resolution
    • Error at apply time if namespace is empty and resource is namespace scoped (makes the apply error more actionable, but requires namespace to always be specified)
  3. Change sorting to only sort one stage at a time (dynamic task scheduling).
    • Add a SortTask after every Apply+Wait and Delete+Wait. This new task would determine the next tasks to execute and add them to the task queue.
    • Replace graph.SortObjs with a graph.Next (or similar) that just returns the next UnstructuredSet to apply/delete.
  4. Resolve implicit namespace by checking against the UnstructuredSet being sorted,
    • Lookup the resource both without the namespace and with the namespace, to see if there is a match.
    • If there's no match, skip adding the graph edge and log a warning
    • This only works for dependencies in the resource set (not external deps: https://github.com/kubernetes-sigs/cli-utils/pull/412)
haiyanmeng commented 2 years ago
  1. Disable implicit namespace resolution

Today, all the namespaced objects sent from Config Sync to cli-utils have the metadata.namespace field explicitly set. Is it reasonable to expect this from other users of cli-utils ?

karlkfi commented 2 years ago

Today, all the namespaced objects sent from Config Sync to cli-utils have the metadata.namespace field explicitly set. Is it reasonable to expect this from other users of cli-utils ?

cli-utils already requires this.

This problem here is with apply-time-mutation annotations which were initially designed to support inferring the namespace of dependencies from the namespace of the dependent object. However, it does look like disabling implicit namespace resolution for apply-time-mutation is the easier path forward, and would align with the requirement on the objects themselves.

haiyanmeng commented 2 years ago

cli-utils already requires this.

I don't think this is true. I tested with kapply built from the master, which doesn't require the metadata.namespace field of a Deployment object to be explicitly set.

karlkfi commented 2 years ago

The Applier runs the Valdator which runs the validateNamespace method, passing in the list of CRDs in the set and the mapper.

The valdiator should error if it's namespace-scoped with no namespace or cluster-scoped with a namespace.

https://github.com/kubernetes-sigs/cli-utils/blob/master/pkg/object/validate.go#L124

haiyanmeng commented 2 years ago

Accurately speaking, Applier.Run requires the metadata.namespace field of a namespaced object to be explicitly set.

kapply does not require this, since it sets the metadata.namespace field of a namespaced object if it is missing.

karlkfi commented 2 years ago

Thanks for clarifying. I don't know if having kapply do that really helps us test, but it might be trying to duplicate kpt and kubectl behavior.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

karlkfi commented 2 years ago

/remove-lifecycle rotten

karlkfi commented 2 years ago

Short term fix is to disable implicit namespaces, since they don't work as-is: https://github.com/kubernetes-sigs/cli-utils/pull/482

Long term fix is unknown.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

karlkfi commented 2 years ago

/remove-lifecycle rotten /lifecycle frozen