Consider defaulting logic in pod.Spec.SecurityContext.FSGroupChangePolicy

gnufied commented 3 years ago

When we introduced this field in alpha we documented this as:

" // Valid values are "OnRootMismatch" and "Always". If not specified defaults to "Always"."

But we aren't doing this strictly speaking. But if we did set the default to "Always", then it has implications for controllers who are hashing pod spec, because a k8s upgrade will cause all of these controllers to rollout a pod update now, which is not very desirable.

The discussion from slack:

thockin @liggitt wrt defaulting fields under Pod - are we hard-banning this? The whole API defaulting-per-version mechanism will kind of come apart if we do 12:10 PM liggitt there are two problematic consumers of pod specs: 1) the kubelet mechanism for determining if a container needs to be restarted, 2) controller mechanisms for determining if a new rollout is required 12:10 PM if those were resolved, we could allow it again 12:11 PM defaulting-per-version doesn't impact pods until we have pods v2, right? 12:11 PM thockin Well, we have PRs that are adding policy fields that transitively end up under pod 12:11 PM so would trigger the above 12:13 PM The question here is really - are we on crack thinking that we can allow new fields with default values which formalize the previously unstated behavior? Or do we need to say that all such changes must have "not-specified" mean "whatever used to be implied". 12:13 PM so versioned->internal would convert nil -> "ExplicitValue" and the same in reverse 12:14 PM for v1 12:17 PM msau42 is this problem specific to pods, or for any resource that is templated, for example, statefulset pvc templates? 12:17 PM thockin It's particular to pods only because there are things that hash pod 12:18 PM but it's not actually particular 12:19 PM in general, how can anyone know that 2 reads of the same object are the same when a new field shows up? Short of open-coding JUST the fields you care about (which will end up badly, too) you can't 12:19 PM gnufied joined #api-reviews. 12:20 PM thockin So maybe the right answer is to leave it nil in existing versions, make it explicit on conversion to internal, and nil on the way back to versioned. Adding a v2 AFTER a field has been introduced can add an explicit default 12:21 PM This would be a big change in guidance overall, but seems the only way to ensure the property that 2 reads of the same resource always produce the same non-status value 12:24 PM msau42 what is the purpose of making it explicit in internal if the versioned read is nil? 12:25 PM thockin Because 2 different aPI versions could have different defaults. 12:27 PM msau42 so if v1 default was X and v2 default was Y, then if a consumer reads a v1 object that was created with v1, they get back nil. But if they read v1 of an object that was created with v2, they get back Y? 12:29 PM thockin Yes. 12:29 PM in v1, not-specified means X, but Y means Y. 12:30 PM we already have a rule that if you add a field, it must have the same default in all EXISTING versions 12:30 PM so if v2 default is Y, that implies v2 was added AFTER the field, meaning X and Y would be explicit. 12:32 PM as I said - this is significantly different that existing guidance, so I will give @liggitt time to think and circle back to this. I dislike forcing clients to encode implicit defaults, but that's the tradeoff. 12:32 PM liggitt sorry, got sidetracked (edited) 12:32 PM thockin unless we make even deeper machinery changes 12:32 PM Sure, it's not like you're busy or anything :rolling_on_the_floor_laughing: 2

12:33 PM liggitt need to stew on it, will copy this into a more permanent location for stewing 12:33 PM gnufied for some context @liggitt https://github.com/kubernetes/kubernetes/pull/88488#discussion_r387292464 . we were trying to decide if it makes sense to add defaulting logic to alpha field being introruced in pod.spec.securitycontext 12:34 PM thockin ACK @liggitt 12:34 PM I will stew on it too, see if we can think up something more palatable. 12:34 PM gnufied is there any in the meanwhile guidance for this PR? I have the commit that adds defaulting logic. 12:35 PM thockin for alpha, I guess you can skip defaulting, then. If you are stuck, just step past this 12:36 PM but if we decided to solve this, we should solve it across the board :+1: 1

3:57 PM

thockin FWIW I have been stewing on this API defaulting thing since yesterday, and I still don't see any better solution. I'm hoping to write it up in the form a PR against API docs for debate. 7:16 PM thockin Oh crikey - the implications of this are awful. The guidance on pluralization, for example, is now wrong. 7:51 PM thockin Uggh, pluralizing is fugly - don't know if client originally specified both or not. 7:51 PM This whole doc needs a rewrite 8:40 PM thockin @liggitt WIP to illustrate some of the ugly: https://github.com/kubernetes/community/pull/4571 (edited)

Also https://github.com/kubernetes/community/pull/4571

k8s-ci-robot commented 3 years ago

@gnufied: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

gnufied commented 3 years ago

cc @kubernetes/sig-storage-misc /sig storage

jingxu97 commented 3 years ago

Since there are only two values in this new field, would it be better to make a bool value, e.g., *enableFSGroupOnRootMistach and default is nil.

thockin commented 3 years ago

I Just re-read this and the context and I still hate it all. The only good answer is to not hash blindly. I don't know where that code is or whether it could be easily fixed (or fixed at all). I guess we could try to not call defaults on embedded PodSpecs (or to minimize anyway)

@liggitt @lavalamp who else can we drag in on this? I think we are now (or will soon be) blocking PRs and features.

liggitt commented 3 years ago

The kubelet hashing is described in https://github.com/kubernetes/kubernetes/issues/63814

Controller hashing/comparison/mechanisms are pretty unique to each controller. I know I swept them all a year or so ago and summarized the current state (some hash, some deepequal, some compare specific fields), but I can't find the summary now.

Limiting defaulting to Pods (as opposed to PodSpecs) would limit the issues to the ones described in https://github.com/kubernetes/kubernetes/issues/63814, which are probably more tractable to solve, since the kubelet mechanism can be changed/improved on a minor version.

gnufied commented 3 years ago

@liggitt https://github.com/kubernetes/kubernetes/issues/63814 looks more about container spec but this change does not affect that and ideally even if defaults are applied - it won't cause a container restart.

For now - in my PR https://github.com/kubernetes/kubernetes/pull/96376/files#diff-19b27e79a2dcb337f58b0f7f3477a29d29498b5d135fbb67e9f858cf1dae222cR2968 I have avoided calling out default enum value explicitly.

@jingxu97 I would prefer if we keep enum for this because there is a chance that it can be expanded and given kubelet and api-server can handle new values, it might be okay.

lavalamp commented 3 years ago

"Always", then it has implications for controllers who are hashing pod spec

I feel like a broken record, but controllers simply should not be doing this. It's not even good enough to only hash fields you care about, because mutating admission webhooks.

thockin commented 3 years ago

I looked quickly at only defaulting fields under Pod when actually under a Pod and, of course, it's not easy. Some types are used in other places (e.g. PVs) so simply merging the defaulting function into the pod function will not work. Adding a Typedef fails for other reasons (API rule violation: list_type_missing...)

We can't have a permanent moratorium on this

On Mon, Nov 9, 2020 at 9:40 AM Daniel Smith notifications@github.com wrote:

"Always", then it has implications for controllers who are hashing pod spec

I feel like a broken record, but controllers simply should not be doing this. It's not even good enough to only hash fields you care about, because mutating admission webhooks.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-724165450, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVAKKKY76RO5HPFUGRDSPASQDANCNFSM4SRXZCMA .

lavalamp commented 3 years ago

We can't have a permanent moratorium on this

On what?

Improving this by changing defaults just isn't going to work due to mutating admission webhooks being able to make arbitrary changes.

Controllers have to tolerate unexpected defaults.

thockin commented 3 years ago

I meant that we can't have a moratorium on any new fields with default values that are transitively under Pod. This is effectively what we have today.

On Fri, Nov 20, 2020 at 4:27 PM Daniel Smith notifications@github.com wrote:

We can't have a permanent moratorium on this

On what?

Improving this by changing defaults just isn't going to work due to mutating admission webhooks being able to make arbitrary changes.

Controllers have to tolerate unexpected defaults.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-731473506, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVARUPD5AFDJ5MCENTTSQ4COTANCNFSM4SRXZCMA .

lavalamp commented 3 years ago

We should not have that moratorium today, controllers broken by that will (probably) also be broken by a mutating admission webhook.

liggitt commented 3 years ago

There are two scenarios that have to be fixed before this can be relaxed:

API server upgrade triggers new rollout of a workload with no changes/writes to the workload object, due to a delta between the read object and the snapshotted controller revision. An example was the procMount defaulting which caused a daemonset or deployment to replace all its pods on API upgrade.
API server upgrade triggers restart of containers due to the kubelet detecting a delta between the serialized v1.Container read from the API and the one used to start the local container because of a new default, with no changes/writes to the pod. I think this is way worse, because I think it means an API server upgrade could make all containers in all replicas simultaneously restart.

Note than neither of those scenarios involve problematic assumptions that admission won't modify fields, etc (I agree any controller assuming that is already broken)

lavalamp commented 3 years ago

Sure, defaults can "modify" immutable fields (you can also get that behavior with multi-version CRDs plus conversion webhooks, I think).

API server upgrade triggers new rollout of a workload with no changes/writes to the workload object, due to a delta between the read object and the snapshotted controller revision. An example was the procMount defaulting which caused a daemonset or deployment to replace all its pods on API upgrade.

Whatever the controller is doing to detect "a delta between the read object and the snapshotted" is broken for my case also, I agree it needs to be fixed. I believe an admission webhook + no-op write (e.g., triggered by a storage migrator) could trigger this, not just a default change.

kubelet detecting a delta between the serialized v1.Container read from the API and the one used to start the local container because of a new default

Containers should be immutable except for specific fields, so it's strange that we managed to get this one wrong.

Fundamentally controllers and controller-like things have to do diffing at a semantic level and not a representational level. My claim is that these things are already broken and are ticking time bombs (at least the first one). I think Jordan's claim is that we shouldn't punish users for mistakes made by controllers, since we can avoid triggering those controller bugs. I personally take the iterated view, i.e. the only way to really long term not hurt users is to get those controllers fixed. (Sadly in this case I think tickling this bug will cause a rollback rather than the controller getting fixed, so I agree with Jordan in practice.)

Now that dry-run exists, controllers could use it to canonicalize data for at least some objects. kubelet is on its own, though.

thockin commented 3 years ago

On Mon, Nov 23, 2020 at 10:01 AM Daniel Smith notifications@github.com wrote:

Sure, defaults can "modify" immutable fields (you can also get that behavior with multi-version CRDs plus conversion webhooks, I think).

API server upgrade triggers new rollout of a workload with no changes/writes to the workload object, due to a delta between the read object and the snapshotted controller revision. An example was the procMount defaulting which caused a daemonset or deployment to replace all its pods on API upgrade.

Whatever the controller is doing to detect "a delta between the read object and the snapshotted" is broken for my case also, I agree it needs to be fixed. I believe an admission webhook + no-op write (e.g., triggered by a storage migrator) could trigger this, not just a default change.

kubelet detecting a delta between the serialized v1.Container read from the API and the one used to start the local container because of a new default

Containers should be immutable except for specific fields, so it's strange that we managed to get this one wrong.

Think of a new field that makes some existing behavior modifiable. The default value codifies "how it's always been done" but a read of an old object will suddenly start showing this field. We have always allowed such API changes, modulo gating and things like that.

Fundamentally controllers and controller-like things have to do diffing at a semantic level and not a representational level. My claim is that these things are already broken and are ticking time bombs (at least the first one). I think Jordan's claim is that we shouldn't punish users for mistakes made by controllers, since we can avoid triggering those controller bugs. I personally take the iterated view, i.e. the only way to really long term not hurt users is to get those controllers fixed. (Sadly in this case I think tickling this bug will cause a rollback rather than the controller getting fixed, so I agree with Jordan in practice.)

So I am looking for how we can ACTUALLY make progress here. Whose P0 is it to fix this (might be multiple) and are there any cross-cutting things we can do to make this easier?

Now that dry-run exists, controllers could use it to canonicalize data for at least some objects. kubelet is on its own, though.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

liggitt commented 3 years ago

Containers should be immutable except for specific fields, so it's strange that we managed to get this one wrong.

The kubelet limiting its container comparison to specific fields it thinks are mutable (currently just image?) would certainly help the "simultaneous uncoordinated restart of all deployment containers in the cluster" scenario a lot.

It is interesting to consider what would happen when a kubelet's view of which container fields are mutable does not match the API server's. There's a current proposal for relaxing immutability restrictions on resource quantities... I think a kubelet which did not expect those to change would just ignore the change and leave the container running with the old resources. That seems like a better outcome than an uncontrolled restart of the container.

So I am looking for how we can ACTUALLY make progress here. Whose P0 is it to fix this (might be multiple) and are there any cross-cutting things we can do to make this easier?

One test that could flush out controller issues in the presence of new defaults would be to:

create a workload object
let the controller actuate it and stabilize
mess with whatever the controller used to record the "last actuated" pod template (e.g. a ControllerRevision) and intentionally drop a defaulted field from it
touch a non-spec field in the workload object (e.g. an annotation) to force a controller sync
verify no new rollout is done

That seems like it could be done in a table test for all in-tree controllers (daemonset, deployment, replicaset, replication controller, statefulset, job, cronjob)

lavalamp commented 3 years ago

Containers should be immutable except for specific fields, so it's strange that we managed to get this one wrong.

Think of a new field that makes some existing behavior modifiable. The default value codifies "how it's always been done" but a read of an old object will suddenly start showing this field.

Yes, so such a field certainly should not be considered semantically meaningful by a component that won't act differently for the particular values!

We have always allowed such API changes, modulo gating and things like that.

Yes, they are safe if controllers behave correctly.

lavalamp commented 3 years ago

So I am looking for how we can ACTUALLY make progress here. Whose P0 is it to fix this (might be multiple)

Indeed, it is everyone's, so it is no one's. Hence my theory of repeatedly breaking it until it gets fixed in the right place. (which works better with slightly lower stakes)

I think Jordan's test is hard to write generically because we don't generically understand the input/output of controllers. We could insert a fake new defaulted field in the schema to generalize part of it, rather than finding a defaulted field to drop (which isn't a great test since you could fix the problem for single fields without fixing it for fields you haven't seen before).

Maybe the most useful thing is actually a unit test framework for adjusting the schema of input objects. I really think individual controller authors are going to have to test, because everyone's side effects are undesired in their own specific way.

thockin commented 3 years ago

On Wed, Dec 9, 2020 at 2:37 PM Jordan Liggitt notifications@github.com wrote:

Containers should be immutable except for specific fields, so it's strange that we managed to get this one wrong.

The kubelet limiting its container comparison to specific fields it thinks are mutable (currently just image?) would certainly help the "simultaneous uncoordinated restart of all deployment containers in the cluster" scenario a lot.

It is interesting to consider what would happen when a kubelet's view of which container fields are mutable does not match the API server's. There's a current proposal for relaxing immutability restrictions on resource quantities... I think a kubelet which did not expect those to change would just ignore the change and leave the container running with the old resources. That seems like a better outcome than an uncontrolled restart of the container.

I'm not sure about that - it's a single pod, so any restart is equivalent to any other, more or less. But I am pressing for a more robust ACK protocol in this specific case so ignoring changes to resources should be correct and while it is a bit opaque as to whether/when kubelet will actuate that change (it won't if it is back-rev) it will be unambiguous whether it DID accept the change.

So I am looking for how we can ACTUALLY make progress here. Whose P0 is it to fix this (might be multiple) and are there any cross-cutting things we can do to make this easier?

One test that could flush out controller issues in the presence of new defaults would be to:

create a workload object

let the controller actuate it and stabilize

mess with whatever the controller used to record the "last actuated" pod template (e.g. a ControllerRevision) and intentionally drop a defaulted field from it

touch a non-spec field in the workload object (e.g. an annotation) to force a controller sync

verify no new rollout is done

That seems like it could be done in a table test for all in-tree controllers (daemonset, deployment, replicaset, replication controller, statefulset, job, cronjob)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-742107287, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVGPIFQUMNSN5EHQEV3ST73ZNANCNFSM4SRXZCMA .

thockin commented 3 years ago

Are there open bugs against sig-node and sig-apps that they are broken and are now impeding progress in the project?

On Wed, Dec 9, 2020 at 3:06 PM Daniel Smith notifications@github.com wrote:

So I am looking for how we can ACTUALLY make progress here. Whose P0 is it to fix this (might be multiple)

Indeed, it is everyone's, so it is no one's. Hence my theory of repeatedly breaking it until it gets fixed in the right place. (which works better with slightly lower stakes)

I think Jordan's test is hard to write generically because we don't generically understand the input/output of controllers. We could insert a fake new defaulted field in the schema to generalize part of it, rather than finding a defaulted field to drop (which isn't a great test since you could fix the problem for single fields without fixing it for fields you haven't seen before).

Maybe the most useful thing is actually a unit test framework for adjusting the schema of input objects. I really think individual controller authors are going to have to test, because everyone's side effects are undesired in their own specific way.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

liggitt commented 3 years ago

Are there open bugs against sig-node and sig-apps that they are broken

updated title/description of https://github.com/kubernetes/kubernetes/issues/63814 for sig-node, changed kind to bug, and increased priority

kow3ns commented 3 years ago

The commonality between all workloads controllers is that they retrieve the declared user intent from the API server and attempt to reconcile it with what they observe (believe) the state of the system to be. We can't look directly at Pods for the state of workload configuration because of defaulting and mutating admission control (i.e. when I create a Pod the Pod observed in the API machinery will look nothing like the one specified in the .Spec.Template). So instead, we try to be sensitive to modifications to the .Spec.Template of the workload resource (i.e. we look for modifications to the users intent).

If this too can arbitrarily change, at any point in time, without a modification from the user with respect to their declared, desired intent, and if SemanticDeepEquals can't be trusted to report that the users declared intent has been modified in the API machinery, how are controller authors, for in tree or out of tree controllers, supposed to reason about change detection? One approach would be for each controller to register interest, vai some TBD mechanism or library, on a per field level that is more granular then .Spec.Template, and trigger reconciliation only on modifications to that "of interest" set (i.e. all controllers attempt to not just determine if the declared specification has changed, but also if that change is relevant).

If we take that approach, than imo we should have a uniform mechanism that enables this throughout the codebase as opposed to ad hoc approaches in tree and bespoke implementations out of tree. Again, biasing for workloads controllers, the number of "interesting fields" is large and generally common across all workloads APIs (and probably more broadly?). When a new field is added I think ensuring that we don't have to update every controller and can have a reasonable expectation of uniform behavior without explicit code modifications should be goal.

thockin commented 3 years ago

Fundamentally this comes down to how we do defaulting and re-use types. If Deployment didn't use PodSpec, it would not get the defaults for PodSpec. But then we'd have to track 2 different podSpec types that have all the same fields.

On a lark I did this:

$ git diff staging/src/k8s.io/api/core/v1/types.go
diff --git a/staging/src/k8s.io/api/core/v1/types.go b/staging/src/
k8s.io/api/core/v1/types.go
index 6465fc53959..bfed73f500a 100644
--- a/staging/src/k8s.io/api/core/v1/types.go
+++ b/staging/src/k8s.io/api/core/v1/types.go
@@ -3707,9 +3707,11 @@ type PodTemplateSpec struct {
        // Specification of the desired behavior of the pod.
        // More info:
https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
        // +optional
-       Spec PodSpec `json:"spec,omitempty"
protobuf:"bytes,2,opt,name=spec"`
+       Spec TemplateSpecHack `json:"spec,omitempty"
protobuf:"bytes,2,opt,name=spec"`
 }

+type TemplateSpecHack PodSpec
+
 // +genclient
 // +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

And then I re-made generated files. Aside from a bunch of "API rule violation" errors which seem to be excepted for pod spec (we should fix those?) a couple things emerged.

First, all of the autogenerated calls to corev1.SetDefaults_PodSpec(&in.Spec.Template.Spec) went away. That's close to what we want.

Second, conversion generation kind of lost its mind and generated Convert_v1_TemplateSpecHack_To_v1_TemplateSpecHack() which does a field-by-field copy. I am not sure why, but I am sure it could be fixed.

Third, deepcopy also lost its mind.

Changing the internal type to the typedef generated more reasonable output, but missed the opportunity to reuse the existing functions.

To the first point, I think it's likely that there are some defaults we REALLY DO want to apply to all PodSpec regardless of where they are used and some defaults we want to apply only on "leaf" definitions. If I define a default er func for the new type, it will be called where the PodSpec defaulter used to be called. We could build our own sort of "inheritance tree", manually. Maybe even automatically. I poked at the defaulter but a) Underlying was not set and b) it has changed a lot since I last touched it. ISTR that Go doesn't track intermediate typedefs.

I have no idea what else would break in doing this. My point was that we have a mess of our own making, and even if the "right" fix is for controllers to be more judicious in what they hash. we could maybe attack it this way, too.

On Thu, Jan 14, 2021 at 7:55 PM Kenneth Owens notifications@github.com wrote:

The commonality between all workloads controllers is that they retrieve the declared user intent from the API server and attempt to reconcile it with what they observe (believe) the state of the system to be. We can't look directly at Pods for the state of workload configuration because of defaulting and mutating admission control (i.e. when I create a Pod the Pod observed in the API machinery will look nothing like the one specified in the .Spec.Template). So instead, we try to be sensitive to modifications to the .Spec.Template of the workload resource (i.e. we look for modifications to the users intent).

If this too can arbitrarily change, at any point in time, without a modification from the user with respect to their declared, desired intent, and if SemanticDeepEquals can't be trusted to report that the users declared intent has been modified in the API machinery, how are controller authors, for in tree or out of tree controllers, supposed to reason about change detection? One approach would be for each controller to register interest, vai some TBD mechanism or library, on a per field level that is more granular then .Spec.Template, and trigger reconciliation only on modifications to that "of interest" set (i.e. all controllers attempt to not just determine if the declared specification has changed, but also if that change is relevant).

If we take that approach, than imo we should have a uniform mechanism that enables this throughout the codebase as opposed to ad hoc approaches in tree and bespoke implementations out of tree. Again, biasing for workloads controllers, the number of "interesting fields" is large and generally common across all workloads APIs (and probably more broadly?). When a new field is added I think ensuring that we don't have to update every controller and can have a reasonable expectation of uniform behavior without explicit code modifications should be goal.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-760628768, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVDY4CIR7XP6NQUA2UTSZ64EPANCNFSM4SRXZCMA .

lavalamp commented 3 years ago

type TemplateSpecHack PodSpec

I think we should not use a type alias. It breaks the tooling but doesn't get us the other thing that might be useful: distinct documentation between pod spec and template pod spec.

Instead, I think we should just literally copy the type. We (well, at least @bgrant0607) have talked about this for ages and just never done it.

I don't think it'd be too bad, I think the conversion generator can generate "conversion" functions if we can invoke it the right way. We can write tests to ensure that the fields stay in lock step between the pod types and the template types.

Note we can do this somewhat at our leisure, copying only the top level type, and then proceeding to copy over intermediate and leaf types as we want to change their defaults or documentation.

thockin commented 3 years ago

I agree overall. It seems obvious for small types. For "big" types like this, I worry about drift between them. Maybe some metafile + a test to list which fields are allowed to differ between the types. E.g.

baseType: PodSpec
derivedTypes:
  - name: PodTemplatePodSpec
    diffs:
      - add: Foo
      - remove: Bar
  - name: OtherUsePodSpec
    diffs:
      - remove: RestartPolicy

On Tue, Jan 26, 2021 at 8:52 AM Daniel Smith notifications@github.com wrote:

type TemplateSpecHack PodSpec

I think we should not use a type alias. It breaks the tooling but doesn't get us the other thing that might be useful: distinct documentation between pod spec and template pod spec.

Instead, I think we should just literally copy the type. We (well, at least @bgrant0607 https://github.com/bgrant0607) have talked about this for ages and just never done it.

I don't think it'd be too bad, I think the conversion generator can generate "conversion" functions if we can invoke it the right way. We can write tests to ensure that the fields stay in lock step between the pod types and the template types.

Note we can do this somewhat at our leisure, copying only the top level type, and then proceeding to copy over intermediate and leaf types as we want to change their defaults or documentation.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-767677760, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVBNU4I2KCIZ27UA3ZLS33XMTANCNFSM4SRXZCMA .

kow3ns commented 3 years ago

To clarify are we suggesting adding one additional type that does alias or aggregate PodSpec that all controllers will use? We are not suggesting a unique type (e.g. ReplicaSetPodTempate, StatefulSetPodTemplate, etc) for each resource that presently aggregates PodSpec into its .Spec.Template to represent the users declared desired state?

smarterclayton commented 3 years ago

I don't think it'd be too bad, I think the conversion generator can generate "conversion" functions if we can invoke it the right way. We can write tests to ensure that the fields stay in lock step between the pod types and the template types.

Conversion of equivalent types is now done automatically by go (once they fixed the "structs with identical fields but different tags are still identical" spec issue).

Helpers are going to need an interface (like ObjectMeta, probably) of some form in order to work. In general our helper functions are already ugly ducklings.

lavalamp commented 3 years ago

Yes, we should make a test. I don't think it's a hard test as long as we require that the fields be 1:1.

If we ever permit the fields to be different in the template and the pod spec, then we basically need to write a constructor and test that.

not suggesting a unique type (e.g. ReplicaSetPodTempate, StatefulSetPodTemplate, etc)

I was thinking of a single PodSpecTemplate type. There would be (can be) no wire difference. I'd need to see a really compelling reason to let different controllers have different template types, that sounds like a ton of overhead.

Note that we cannot change the defaulting for existing fields as this would be a behavior change. This mechanism can only be used to make a newly added field have different behavior between pod & template.

This is more immediately useful for documentation differences.

liggitt commented 3 years ago

I don't think you can easily get away with a shallow type clone... the defaulting methods are associated with deeply nested subtypes (e.g. container security context).

These are the areas we'd see impact:

helper methods (e.g. DropDisabledPodFields / DropDisabledTemplateFields both use a shared method that operates on podspec)
validation
- all the workload types reuse the super-twisty podspec validation... would you envision those validation methods converting from PodTemplateSpec to PodSpec and then continuing to use the PodSpec validation?
- some workload types layer on specific validation of their own (e.g. only allowing a subset of restart policies, etc)
documentation - I agree the per-workload type copy seems excessive. That said, we already have places where the current docs are confusing/misleading (like the restart policy subset mentioned above)

lavalamp commented 3 years ago

Ugh,

duplicate the validation functions? We would need to rearrange the tests such that they test both copies of the functions (e.g. by transforming the types first).
copy the validation functions, but make them transform the input and call the original validation functions?
make all callers transform the type first? (essentially what we do today)

smarterclayton commented 3 years ago

Validation is already a bit of a disaster w.r.t copy and paste, I don't want to make it worse (no one will keep the copies in sync unless we automate it, and if we automate it, we shouldn't need to duplicate).

Type transform if the types are identical should be cheap in go compiler but we might need to test for it.

There's an argument here w.r.t. docs is that our current approach for docs on go types is a good start but not sufficient. Having to clone the types to get the docs right seems like it's the tail wagging the dog. Maybe variations in type docs could be approached differently. For instance, maybe there's some merit to the parts of the IDL proposal that hoist certain data off go types into something above this level.

thockin commented 3 years ago

If we ever permit the fields to be different in the template and the pod spec, then we basically need to write a constructor and test that.

This pattern will apply in other places (I'm thinking about probes, for example) and there is value in saying "this is supposed to be the same except missing one field".

I was thinking of a single PodSpecTemplate type. There would be (can be) no wire difference. I'd need to see a really compelling reason to let different controllers have different template types

Agree, but it seems like a slippery slope. Once precedent is set, I expect new uses of this pattern will be found.

Note that we cannot change the defaulting for existing fields as this would be a behavior change.

Pedantic - we cannot change user-visible behavior. We could, for instance, change the internals if it make our structure easier.

I don't think you can easily get away with a shallow type clone... the defaulting methods are associated with deeply nested subtypes

We could re-parent the defaulting logic into SetDefaults...() for the top-most spec types. It is less "clever" but preserves user-facing behavior and breaks the deep type coupling.

Excellent points on the rest - these types are LARGE. I don't feel excited about cloning the validation logic (for example). Have some third definition that is the common core, which can be trivially converted to for validation/helpers?

https://play.golang.org/p/g5tIp0DEDmS

For instance, maybe there's some merit to the parts of the IDL proposal that hoist certain data off go types into something above this level.

Yes, my mind went here, too. Embedded types with per-field doc overrides?

On Tue, Jan 26, 2021 at 9:38 AM Daniel Smith notifications@github.com wrote:

Yes, we should make a test. I don't think it's a hard test as long as we require that the fields be 1:1.

If we ever permit the fields to be different in the template and the pod spec, then we basically need to write a constructor and test that.

not suggesting a unique type (e.g. ReplicaSetPodTempate, StatefulSetPodTemplate, etc)

I was thinking of a single PodSpecTemplate type. There would be (can be) no wire difference. I'd need to see a really compelling reason to let different controllers have different template types, that sounds like a ton of overhead.

Note that we cannot change the defaulting for existing fields as this would be a behavior change. This mechanism can only be used to make a newly added field have different behavior between pod & template.

This is more immediately useful for documentation differences.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-767707901, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVCSOAOH5DQMUOM4YYTS3342BANCNFSM4SRXZCMA .

lavalamp commented 3 years ago

Type transform if the types are identical should be cheap in go compiler

Relying on type transform would still require a bunch of casts in many places, I think? Not sure it makes things significantly different.

Pedantic - we cannot change user-visible behavior. We could, for instance, change the internals if it make our structure easier.

In theory yes, but in practice it's extremely hard to be sure that an internal change won't cause an obscure behavior change.

kow3ns commented 3 years ago

If we are doing this for PodTemplate spec we should dig ourselves out of the hole early wrt StatefulSet .Spec.VolumeClaimsTemplate as well

kow3ns commented 3 years ago

validation all the workload types reuse the super-twisty podspec validation... would you envision those validation methods converting from PodTemplateSpec to PodSpec and then continuing to use the PodSpec validation?

If we have a drift between what a valid PodTemplateSpec and PodSpec are we will end up with wedged controllers that have a declared, desired state that is "valid" per admission control but that are unable to create Pod resources and make progress. It will be very difficult (imo) to surface the right error conditions to users to help them unwedge their clusters under those circumstances. i.e. A valid PodTemplateSpec should never create an invalid Pod resource.

thockin commented 3 years ago

A valid PodTemplateSpec should never create an invalid Pod resource

I think that's a given. What isn't as clear to me is that every field that is configurable in a Pod must be exposed in a pod template. In fact, we know that there are validations that are different between them, fields that are N/A or must have particular values.

We can't retroactively change those (I think? maybe some), but we should not add more.

On Tue, Jan 26, 2021 at 10:29 AM Kenneth Owens notifications@github.com wrote:

validation all the workload types reuse the super-twisty podspec validation... would you envision those validation methods converting from PodTemplateSpec to PodSpec and then continuing to use the PodSpec validation?

If we have a drift between what a valid PodTemplateSpec and PodSpec are we will end up with wedged controllers that have a declared, desired state that is "valid" per admission control but that are unable to create Pod resources and make progress. It will be very difficult (imo) to surface the right error conditions to users to help them unwedge their clusters under those circumstances. i.e. A valid PodTemplateSpec should never create an invalid Pod resource.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-767740084, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVBYNFYUV7CUT3J6KSLS34CZVANCNFSM4SRXZCMA .

smarterclayton commented 3 years ago

A valid PodTemplateSpec should never create an invalid Pod resource.

I think this is impossible, because you can have people implementing workload controllers that mature at different rates and have different pod specs, kubelets, and pod template spec levels. If you mean, just in Kube, it's trivial (and allowed) to create a webhook that breaks this assumption, so it's not as if the controllers can bypass all the logic that communicates well to users when pods are rejected. There are plenty of valid but non-working pod templates, I don't want to imply we should loosen validation, but I don't think we can say that we can ensure valid pod template can insure valid pod with certainty.

(edit: i may be steelmanning your argument in my head though :))

lavalamp commented 3 years ago

A valid PodTemplateSpec should never create an invalid Pod resource.

I think this is impossible

It is clearly impossible in theory, even without splitting the types!

In practice, however, it might be possible. Splitting types would be increase slightly the chance of a template that can't instantiate triggering a hot-loop retry bug in a controller.

IMO the way controllers should address this is via webhook admission controller: when the template object is created, attempt a dry-run POST of the instantiated template, to catch problems with it as early as possible. This isn't 100% certain to find all problems but I think it's going to catch all ordinary causes of problems in steady state. (The alternative way to handle this is to notice when the template isn't working at all and set a status condition and retry only very slowly. That's slightly less user friendly.)

smarterclayton commented 3 years ago

The alternative way to handle this is to notice when the template isn't working at all and set a status condition and retry only very slowly. That's slightly less user friendly.)

Slightly less user friendly, but FAR more broadly applicable, since that can happen for other reasons already (bad webhook, pod security, quota, etc etc etc) and getting really good at this closes off a whole class of problems (worse is better). Although I think we're pretty good at this today because if we aren't it almost always brings down the controller manager (some worlkoads lack conditions though and should get them)

lavalamp commented 3 years ago

I mean, why not both? :)

kow3ns commented 3 years ago

IMO the way controllers should address this is via webhook admission controller: when the template object is created, attempt a dry-run POST of the instantiated template, to catch problems with it as early as possible.

Now that we have a server side dry at stable (since 1.18 iirc) it seems reasonable to do a dry-run prior to creation.

Although I think we're pretty good at this today because if we aren't it almost always brings down the controller manager (some worlkoads lack conditions though and should get them)

We added conditions to all the workloads API Status types (for built ins) prior to v1 promotion but we don't universally populate them. We could (should) use them to communicate invalid desired state specification.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

thockin commented 3 years ago

If we made most of the validation declarative, then maybe deep-cloning (near-complete unsharing) isn't so bad?

E.g. let's look at probes (since it is smaller). We find this in Container:

    LivenessProbe *Probe `json:"livenessProbe,omitempty" protobuf:"bytes,10,opt,name=livenessProbe"`

    ReadinessProbe *Probe `json:"readinessProbe,omitempty" protobuf:"bytes,11,opt,name=readinessProbe"`

    StartupProbe *Probe `json:"startupProbe,omitempty" protobuf:"bytes,22,opt,name=startupProbe"`

Then Probe is defined as:

type Probe struct {
    // The action taken to determine the health of a container
    Handler `json:",inline" protobuf:"bytes,1,opt,name=handler"`
    // Number of seconds after the container has started before liveness probes are initiated.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    InitialDelaySeconds int32 `json:"initialDelaySeconds,omitempty" protobuf:"varint,2,opt,name=initialDelaySeconds"`
    // Number of seconds after which the probe times out.
    // Defaults to 1 second. Minimum value is 1.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    TimeoutSeconds int32 `json:"timeoutSeconds,omitempty" protobuf:"varint,3,opt,name=timeoutSeconds"`
    // How often (in seconds) to perform the probe.
    // Default to 10 seconds. Minimum value is 1.
    // +optional
    PeriodSeconds int32 `json:"periodSeconds,omitempty" protobuf:"varint,4,opt,name=periodSeconds"`
    // Minimum consecutive successes for the probe to be considered successful after having failed.
    // Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
    // +optional
    SuccessThreshold int32 `json:"successThreshold,omitempty" protobuf:"varint,5,opt,name=successThreshold"`
    // Minimum consecutive failures for the probe to be considered failed after having succeeded.
    // Defaults to 3. Minimum value is 1.
    // +optional
    FailureThreshold int32 `json:"failureThreshold,omitempty" protobuf:"varint,6,opt,name=failureThreshold"`
}

I want to emphasize:

Must be 1 for liveness and startup.

Suppose we changed it to (whitespace added):

type ReadinessProbe struct {
    // The action taken to determine the health of a container
    Handler `json:",inline" protobuf:"bytes,1,opt,name=handler"`

    // Number of seconds after the container has started before liveness probes are initiated.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    // +default=0
    // +validate=NonNegative
    InitialDelaySeconds int32 `json:"initialDelaySeconds,omitempty" protobuf:"varint,2,opt,name=initialDelaySeconds"`

    // Number of seconds after which the probe times out.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    // +default=1 second
    // +validate=GreaterEqual(1)
    TimeoutSeconds int32 `json:"timeoutSeconds,omitempty" protobuf:"varint,3,opt,name=timeoutSeconds"`

    // How often (in seconds) to perform the probe.
    // +optional
    // +default=10
    // +validate=GreaterEqual(1)
    PeriodSeconds int32 `json:"periodSeconds,omitempty" protobuf:"varint,4,opt,name=periodSeconds"`

    // Minimum consecutive successes for the probe to be considered successful after having failed.
    // +optional
    // +default=1
    // +validate=GreaterEqual(1)
    SuccessThreshold int32 `json:"successThreshold,omitempty" protobuf:"varint,5,opt,name=successThreshold"`

    // Minimum consecutive failures for the probe to be considered failed after having succeeded.
    // +optional
    // +default=3
    // +validate=GreaterEqual(1)
    FailureThreshold int32 `json:"failureThreshold,omitempty" protobuf:"varint,6,opt,name=failureThreshold"`
}

type LivenessProbe struct {
    // The action taken to determine the health of a container
    Handler `json:",inline" protobuf:"bytes,1,opt,name=handler"`

    // Number of seconds after the container has started before liveness probes are initiated.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    // +default=0
    // +validate=NonNegative
    InitialDelaySeconds int32 `json:"initialDelaySeconds,omitempty" protobuf:"varint,2,opt,name=initialDelaySeconds"`

    // Number of seconds after which the probe times out.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    // +default=1 second
    // +validate=GreaterEqual(1)
    TimeoutSeconds int32 `json:"timeoutSeconds,omitempty" protobuf:"varint,3,opt,name=timeoutSeconds"`

    // How often (in seconds) to perform the probe.
    // +optional
    // +default=10
    // +validate=GreaterEqual(1)
    PeriodSeconds int32 `json:"periodSeconds,omitempty" protobuf:"varint,4,opt,name=periodSeconds"`

    // Minimum consecutive failures for the probe to be considered failed after having succeeded.
    // +optional
    // +default=3
    // +validate=GreaterEqual(1)
    FailureThreshold int32 `json:"failureThreshold,omitempty" protobuf:"varint,6,opt,name=failureThreshold"`
}

type StartupProbe struct {
    // The action taken to determine the health of a container
    Handler `json:",inline" protobuf:"bytes,1,opt,name=handler"`

    // Number of seconds after the container has started before liveness probes are initiated.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    // +default=0
    // +validate=NonNegative
    InitialDelaySeconds int32 `json:"initialDelaySeconds,omitempty" protobuf:"varint,2,opt,name=initialDelaySeconds"`

    // Number of seconds after which the probe times out.
    // More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
    // +optional
    // +default=1 second
    // +validate=GreaterEqual(1)
    TimeoutSeconds int32 `json:"timeoutSeconds,omitempty" protobuf:"varint,3,opt,name=timeoutSeconds"`

    // How often (in seconds) to perform the probe.
    // +optional
    // +default=10
    // +validate=GreaterEqual(1)
    PeriodSeconds int32 `json:"periodSeconds,omitempty" protobuf:"varint,4,opt,name=periodSeconds"`

    // Minimum consecutive failures for the probe to be considered failed after having succeeded.
    // +optional
    // +default=3
    // +validate=GreaterEqual(1)
    FailureThreshold int32 `json:"failureThreshold,omitempty" protobuf:"varint,6,opt,name=failureThreshold"`
}

and

    LivenessProbe *LivenessProbe `json:"livenessProbe,omitempty" protobuf:"bytes,10,opt,name=livenessProbe"`

    ReadinessProbe *ReadinessProbe `json:"readinessProbe,omitempty" protobuf:"bytes,11,opt,name=readinessProbe"`

    StartupProbe *StartupProbe `json:"startupProbe,omitempty" protobuf:"bytes,22,opt,name=startupProbe"`

We can generate 100% of the validation code for this case (probably less for PodSpec, but a good chunk) or maybe even do it at run-time (instead of codegen).

Copy-paste is more likely to be successful here.

Alternately - what if we don't apply defaulting to types when they are used under some "barrier" declaration. E.g.:

// PodTemplateSpec describes the data a pod should have when created from a template
type PodTemplateSpec struct {
    // Standard object's metadata.
    // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
    // +optional
    metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

    // Specification of the desired behavior of the pod.
    // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
    // +optional
    // +nodefaults
    Spec PodSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
}

?

k8s-triage-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/95587#issuecomment-1019751670): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kubernetes

Consider defaulting logic in pod.Spec.SecurityContext.FSGroupChangePolicy #95587