kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.28k stars 39.45k forks source link

[Federation] Federation should fully support clusters with a previous API version. #44160

Closed marun closed 6 years ago

marun commented 7 years ago

As per a thread on the mailing list, it appears that federation will only be possible between a cluster whose API version is >= the API version of the federation control plane. This suggests that the control plane should be explicitly checking the API version indicated for a member cluster and refusing to federate to a cluster with an older API than is in use by the control plane.

cc: @kubernetes/sig-federation-bugs @kubernetes/sig-api-machinery-misc

marun commented 7 years ago

A naive attempt to support backwards compatible federation: #44131

ghost commented 7 years ago

Maru, it's not entirely clear to me that this is not approximately "working as intended" from the client's point of view (other than the e2e tests failing of course, but that's a separate matter. And presumably this leads to an infinite update loop in the controller, and that should be fixed, but that's not very visible to the client). If this is the case, I'd suggest renaming this issue to something like:

"Federation should fully support clusters with a previous API version."

To make sure that I understand the root problem (and perhaps we should be having this discussion in an issue rather), by means of a concrete example:

  1. ReplicaSet v1.6 is created in the Federation Control Plane (FCP), with or without new optional field values specified (e.g. SchedulerName)
  2. ReplicaSet is propagated by FCP to e.g. a Kubernetes v1.5 cluster (to which the new optional field means nothing, and is hence ignored).
  3. FCP reads the v1.5 ReplicaSet back from the v1.5 cluster, and the optional field (e.g. SchedulerName) is missing (which makes sense, because that field does not exist in v1.5).
  4. FCP naively compares the v1.6 Federated ReplicaSet against the v1.5 ReplicaSet from the cluster, determines them to be non-equivalent (due to the missing SchedulerName field), and hence tries to update the v1.5 ReplicaSet in the cluster, ad infinitum.

So one ends up with an infinite update loop in the Federated ReplicaSet controller, but other than that the client sees approximately correct and expected behavior? i.e. They see a v1.5 ReplicaSet in their v1.5 Cluster consistent with the v1.6 Federated ReplicaSet, except for the fields that don't exist in v1.5?

Off the top of my head, the solution seems to be:

  1. After reading the v1.5 ReplicaSet from the cluster, but before comparing it against the v1.6 ReplicaSet in the FCP, convert the v1.6 Federated ReplicaSet to v1.5 (via the standard mechanism), resulting in a (temporary) v1.5 Federated ReplicaSet (which would not include the new, optional v1.6 SchedulerName field), and compare that against the v1.5 ReplicaSet read from the cluster. These should now match, and there should be no infinite update loop.

Does this make sense?

Q

derekwaynecarr commented 7 years ago

@quinton-hoole -- if the goal of federation is to federate across each member cluster, it seems that at minimum, the federation control plane must be <= to the version of each member cluster. let me provide an example. in k8s 1.6, pod spec has a field "toleration". if i send a pod spec to a federation 1.6 server that uses toleration, it will try and replicate that to each member cluster. if i have member clusters running 1.5, 1.4, etc, that field does not exist. it could mean that that field is not supplied on the copy being federated to each member cluster. i suspect many users would consider that a bug, and the safer guidance is that your control plane must be <= version of each member cluster, agree?

ghost commented 7 years ago

From: Maru Newby marun@redhat.com Date: Thu, Apr 6, 2017 at 11:12 AM Subject: Re: [k8s API machinery] Re: Backwards compatible federation To: Quinton Hoole quinton@hoole.biz Cc: Derek Carr decarr@redhat.com, Jonathan MacMillan dvorak@google.com, K8s API Machinery SIG kubernetes-sig-api-machinery@googlegroups.com, kubernetes-sig-federation kubernetes-sig-federation@googlegroups.com

On Thu, Apr 6, 2017 at 10:32 AM, Quinton Hoole quinton@hoole.biz wrote:

Derek, I'm not aware of any such documentation regarding upgrades. That's what Maru's working towards, I believe.

Maru, it's not entirely clear to me that this is not approximately "working as intended" from the client's point of view (other than the e2e tests failing of course, but that's a separate matter. And presumably this leads to an infinite update loop in the controller, and that should be fixed, but that's not very visible to the client).

To make sure that I understand the root problem (and perhaps we should be having this discussion in an issue rather), by means of a concrete example:

  1. ReplicaSet v1.6 is created in the Federation Control Plane (FCP), with or without new optional field values specified (e.g. SchedulerName)
  2. ReplicaSet is propagated by FCP to e.g. a Kubernetes v1.5 cluster (to which the new optional field means nothing, and is hence ignored).
  3. FCP reads the v1.5 ReplicaSet back from the v1.5 cluster, and the optional field (e.g. SchedulerName) is missing (which makes sense, because that field does not exist in v1.5).
  4. FCP naively compares the v1.6 Federated ReplicaSet against the v1.5 ReplicaSet from the cluster, determines them to be non-equivalent (due to the missing SchedulerName field), and hence tries to update the v1.5 ReplicaSet in the cluster, ad infinitum.

So one ends up with an infinite update loop in the Federated ReplicaSet controller, but other than that the client sees approximately correct and expected behavior? i.e. They see a v1.5 ReplicaSet in their v1.5 Cluster consistent with the v1.6 Federated ReplicaSet, except for the fields that don't exist in v1.5?

Correct, the main problem is that the controllers would be constantly updating objects since comparison would always fail.

Off the top of my head, the solution seems to be:

  1. After reading the v1.5 ReplicaSet from the cluster, but before comparing it against the v1.6 ReplicaSet in the FCP, convert the v1.6 Federated ReplicaSet to v1.5 (via the standard mechanism), resulting in a (temporary) v1.5 Federated ReplicaSet (which would not include the new, optional v1.6 SchedulerName field), and compare that against the v1.5 ReplicaSet read from the cluster. These should now match, and there should be no infinite update loop.

Does this make sense?

Your intent is clear, but I don't think what you are proposing is possible today. There is no such thing as a '1.5' or a '1.6' resource. The way backwards compatibility appears to work for the kube api, a field addition is backwards compatible if it is optional to allow older clients (that don't know about the field) to still work. There is nothing in the api machinery that I can see that allows fine-grained determination of which changes were made in a given kube version that would allow conversion to the set of fields supported by a different kube version. I'd be happy to be proven wrong.

ghost commented 7 years ago

@marun I'm no K8s API versioning expert but @nikhiljindal is, and can hopefully confirm.

Continuing our v1.5 vs v1.6 example, if a v1.5 client connects to a v1.6 cluster, all API responses are converted to v1.5 (internally by API the server) on the return path. That is the "standard mechanism" I referred to above.

Some details in https://github.com/kubernetes/kubernetes/issues/4874

@nikhiljindal , @derekwaynecarr , @caesarxuchao or others more familiar with API versioning should be able to clarify further if required.

derekwaynecarr commented 7 years ago

@quinton-hoole -- my takeaway, is that federation supports skew in that member clusters can be later versions, but federation must be <=. i see no other way we can provide any other level of support without losing data.

ghost commented 7 years ago

@derekwaynecarr in response to your comment/question above :

Yes, I agree, as general guidance that might be sensible, although I suspect that the situation is a little more nuanced than that in general, e.g.

  1. If a user uses a client version <= the lowest version of federated clusters, everything should work hunkey-dorey, irrespective of what version their Federation is at, and whether or not it is ahead of any of the federated clusters. This for example allows the federation control plane to be upgraded ahead of, and independent of the clusters, and Federation clients.
  2. If a user uses a client version ahead of some of the federated clusters, but understands that the federated clusters can only support features commensurate with their individual versions, things should behave as expected. I think that this is roughly equivalent to how other parts of Kubernetes work, for example, creating a LoadBalancer Service in a cluster that does not have load balancers installed results in a Service without a load balancer, not an API error. Same goes for e.g. Ingress.

Q

ghost commented 7 years ago

@derekwaynecarr @marun One potentially useful optional feature that we could consider supporting would be preventing (in the FCP) federation of later versioned objects into earlier versioned clusters. So for example we could support a flag on the scheduler (or an optional field/annotation on each object) specifying whether e.g. v1.6 ReplicaSets should be scheduled/federated to earlier versioned clusters or not. That way users and/or administrators could decide whether they prefer stricter API semantics or more scheduling flexibility.

This would be morally equivalent (using the example above) to allowing users/administrators to choose between failing API calls to create e.g. LoadBalancer Services in clusters that do not have external load balancers. That's not currently supported in Kubernetes today either.

marun commented 7 years ago

@quinton-hoole I think you may be confusing 'kube version' and 'api version'.

So kube 1.6 can add a new optional field to a v1 resource, and a kube 1.5 client has no automatic way of knowing that the response it receives is lossy (i.e. missing data for the field added in 1.6). Similarly, a kube 1.5 server can return a v1 resource to a 1.6 client, and the client has no automatic way of determining that optional fields are returned empty because they aren't present in 1.5 vs haven't been explicitly set to the default value (for the type, not the field).

I certainly agree that finer grained api versioning would be useful, but it looks like either federation is going to have to track changes across kube versions manually or forgo the possibility of allowing backwards compatible federation.

marun commented 7 years ago

As per @smarterclayton's recent comment on the ml thread, it may be possible to automate detection of field changes across kube versions either by diff'ing the fields sent vs the fields returned for a given request or diff'ing swagger docs.

ghost commented 7 years ago

From: Clayton Coleman ccoleman@redhat.com Date: Thu, Apr 6, 2017 at 12:53 PM Subject: Re: [k8s API machinery] Re: Backwards compatible federation To: Maru Newby marun@redhat.com Cc: Quinton Hoole quinton@hoole.biz, Derek Carr decarr@redhat.com, Jonathan MacMillan dvorak@google.com, K8s API Machinery SIG kubernetes-sig-api-machinery@googlegroups.com, kubernetes-sig-federation kubernetes-sig-federation@googlegroups.com

Fine grained version indications have been proposed, but have a long chunk of work ahead of them.

What are the concrete affordances we could add today? Certainly whenever a controller updates a cluster it could track what fields it sent that were not returned. That could give us a leading indicator even on very old clusters. The swagger doc is also accurate for the most part - the validation code could be used to do a delta comparison and then put a filter / adapter in front of the diff code.

ghost commented 7 years ago

From: Maru Newby marun@redhat.com Date: Thu, Apr 6, 2017 at 2:06 PM Subject: Re: [k8s API machinery] Re: Backwards compatible federation To: Clayton Coleman ccoleman@redhat.com Cc: Quinton Hoole quinton@hoole.biz, Derek Carr decarr@redhat.com, Jonathan MacMillan dvorak@google.com, K8s API Machinery SIG kubernetes-sig-api-machinery@googlegroups.com, kubernetes-sig-federation kubernetes-sig-federation@googlegroups.com

Both options sound promising. Is there any existing machinery that would would enable diffing of fields across requests?

The approach of only comparing fields actually returned in a request seems simpler than the swagger option. Are there any advantages to the swagger approach that I might be missing?

ghost commented 7 years ago

---------- Forwarded message ---------- From: Clayton Coleman ccoleman@redhat.com Date: Thu, Apr 6, 2017 at 2:33 PM Subject: Re: [k8s API machinery] Re: Backwards compatible federation To: Maru Newby marun@redhat.com Cc: Quinton Hoole quinton@hoole.biz, Derek Carr decarr@redhat.com, Jonathan MacMillan dvorak@google.com, K8s API Machinery SIG kubernetes-sig-api-machinery@googlegroups.com, kubernetes-sig-federation kubernetes-sig-federation@googlegroups.com

On Apr 6, 2017, at 5:06 PM, Maru Newby marun@redhat.com wrote:

Both options sound promising. Is there any existing machinery that would would enable diffing of fields across requests?

The approach of only comparing fields actually returned in a request seems simpler than the swagger option. Are there any advantages to the swagger approach that I might be missing?

Probably that you can do it up front and potentially prevent end users from setting those / being surprised. The problem with the "per-request" approach is that you don't have the info at the time the user submits.

On the other hand, clusters are going to be upgraded underneath you, and so at any point in an apps lifecycle it might degrade. We probably should expect clusters to come and go over time. Having a continuous feedback loop is valuable for sure.

ghost commented 7 years ago

An alternative idea I had was to do the version conversion automatically in the typed clientset, e.g.

import clientset_v1_5 "k8s.io/kubernetes/pkg/client/clientset_generated/release_1_5"

Then, when a version 1.5 cluster is being synced, the FCP uses the object fetched from the FCP API via the above clientset to compare against the object retrieved from that cluster, so that the object versions match and the comparison is automatically correct.

Of course this means retrieving (and/or caching) one object per cluster version from the FCP API (in the FCP controller, in the FederatedInformer). But given that the number of cluster versions in a given federation should be relatively small (~3), that overhead (both in terms of CPU and cache memory) should be negligible. And it also automatically addresses @smarterclayton 's concern above regarding clusters being upgraded under the FCP (as these would be automagically reregistered as their newer versions, and the above caching would adapt automatically).

ghost commented 7 years ago

I updated the title. Let me know if you disagree @marun

marun commented 7 years ago

@smarterclayton

On Thu, Apr 6, 2017 at 2:33 PM, Clayton Coleman ccoleman@redhat.com wrote:

Probably that you can do it up front and potentially prevent end users from setting those / being surprised. The problem with the "per-request" approach is that you don't have the info at the time the user submits.

I'm not sure what you mean since the discussion was about propagating resources to member clusters. How would user submission come into play?

On the other hand, clusters are going to be upgraded underneath you, and so at any point in an apps lifecycle it might degrade. We probably should expect clusters to come and go over time. Having a continuous feedback loop is valuable for sure.

The 'per-request' approach could presumably be simplified to cache field differences for a cluster after the first update had been performed against that cluster for a given type. The cache could be invalidated if the watch for the cluster for that type was interrupted for any reason to ensure the differences would be recomputed on the next update. That would cover the case of cluster upgrade. And it would be relatively cheap - if field differences existed between the federation control plane and a cluster, a single (potentially unnecessary) update would be enough to get field differences to cache.

marun commented 7 years ago

@quinton-hoole Are you suggesting having an informer for each of a fixed set of versions for any watched federated type? Or have the set of versions be configurable? Or determined dynamically from the observed versions of member clusters?

How would the version of member clusters be determined? Is there something at the client layer that can provide that detail?

Is there a potential for client version skew? Kube ensures backwards compatibility of the api by requiring added fields to be optional. Is there any policy around field addition in minor version releases? For example, if the federation control plane was upgraded to 1.6.1 which included a field addition, how would the control plane be able to compare resources against a member cluster still running 1.6.0?

lavalamp commented 7 years ago

(copy from email thread)

This is an interesting problem. Note that our go client has the exact same issue, where it drops fields it fails to understand.

I strongly recommend you do NOT do member-by-member comparison to check for equality, that is very brittle, even non-semantic formatting changes with the same wire format can break it (happened to daemonset, e.g. 0 length vs nil slices).

One approach is to record the metadata.generation returned after a POST or PUT, and assume you do not need to sync again unless (a) the source object changes or (b) the destination generation changes. (a) itself should be detected by changes to the source object's metadata.generation. In fact I'm not sure that there are any reasonable alternatives to this approach.

bgrant0607 commented 7 years ago

We have similar problems even within a single cluster, particularly due to apiserver-kubelet skew: #4855.

lavalamp commented 7 years ago

For example, if the federation control plane was upgraded to 1.6.1 which included a field addition, how would the control plane be able to compare resources against a member cluster still running 1.6.0?

We shouldn't be adding fields in patch releases like that, but if we did, it would be extremely difficult to handle with this proposal, since you'd have to link in 1.6.0 client AND a 1.6.1 client.

For single clusters, the update rule is that you have to update the control plane before the nodes, and you have to update apiserver before the rest of the control plane. We (back in the early days) decided that testing both back AND forwards compat was too much. I think federation needs a similar rule. Maybe 0 forward versions on delegate clusters isn't the right rule, but you at least need to bound this & test it.

bgrant0607 commented 7 years ago

Hmm. The appropriate sig-api-machinery teams don't exist, so I can't notify them.

Before any new API mechanism is developed, it should be discussed in SIG API machinery.

You may be interested in the field gating proposal:

https://docs.google.com/document/d/1wuoSqHkeT51mQQ7dIFhUKrdi3-1wbKrNWeIL4cKb9zU/edit#

lavalamp commented 7 years ago

Oh, I thought @czahedi was going to make those.

bgrant0607 commented 7 years ago

@lavalamp FYI, when I last checked, not all resources set metadata.generation.

bgrant0607 commented 7 years ago

@janetkuo, since it's similar-ish to the Deployment<->ReplicaSet sync'ing problem.

czahedi commented 7 years ago

I am making the API Machinery email lists, GitHub users, etc. now.

bgrant0607 commented 7 years ago

FYI, metadata.generation is not set for every resource yet.

7328

nikhiljindal commented 7 years ago

From @quinton-hoole

@derekwaynecarr @marun One potentially useful optional feature that we could consider supporting would be preventing (in the FCP) federation of later versioned objects into earlier versioned clusters. So for example we could support a flag on the scheduler (or an optional field/annotation on each object) specifying whether e.g. v1.6 ReplicaSets should be scheduled/federated to earlier versioned clusters or not. That way users and/or administrators could decide whether they prefer stricter API semantics or more scheduling flexibility.

This would be morally equivalent (using the example above) to allowing users/administrators to choose between failing API calls to create e.g. LoadBalancer Services in clusters that do not have external load balancers. That's not currently supported in Kubernetes today either.

Yes I agree. Ideally, users should be able indicate what should happen if FCP cannot guarantee all the fields of a replicaset in all clusters. Is it fine for FCP to do best effort (i.e fine to create a ReplicaSet with dropped fields if some clusters do not support a field set by the user (which is what we do today)) or should it generate an error. Is it even possible for FCP to guarantee since a cluster can be downgraded later?

nikhiljindal commented 7 years ago

From @marun @smarterclayton

Probably that you can do it up front and potentially prevent end users from setting those / being surprised. The problem with the "per-request" approach is that you don't have the info at the time the user submits.

I'm not sure what you mean since the discussion was about propagating resources to member clusters. How would user submission come into play?

I assume @smarterclayton meant that with swagger spec FCP can know up front what all fields are supported by each cluster. So if user submits a Replicaset with a field that one of the cluster does not support, then potentially FCP can reject that Replicaset before even trying to propogate it to underlying clusters.

On the other hand, clusters are going to be upgraded underneath you, and so at any point in an apps lifecycle it might degrade. We probably should expect clusters to come and go over time. Having a continuous feedback loop is valuable for sure.

This can be solved by running the "figure out supported fields using swagger spec" logic every time cluster goes from Ready to NotReady. Or we can require admins to unjoin and join cluster again if it is upgraded/downgraded.

The 'per-request' approach could presumably be simplified to cache field differences for a cluster after the first update had been performed against that cluster for a given type. The cache could be invalidated if the watch for the cluster for that type was interrupted for any reason to ensure the differences would be recomputed on the next update. That would cover the case of cluster upgrade. And it would be relatively cheap - if field differences existed between the federation control plane and a cluster, a single (potentially unnecessary) update would be enough to get field differences to cache.

For this to work, first update that tries to figure out what all fields are supported by underlying cluster will have to have all fields set (so that it can find out what fields were dropped).

nikhiljindal commented 7 years ago

In summary, we are discussing 4 issues here:

1. How to determine what fields are supported in each cluster

Proposed solutions: Use swagger spec or send an update request and compare request and response objects. Using swagger spec seems better to me.

2. What to do if a field is not supported in a cluster

Generate an error or do best effort (create the object with dropped fields) Best effort is what we do today. I agree that user should be able to specify the behavior that they want.

3. How to compare objects.

Today we do a field by field comparison of desired state and current state of a resource in a cluster. Proposal is to change that to send an update and verify that metadata.generation was updated. This requires us to finish up https://github.com/kubernetes/kubernetes/issues/7328 first.

4. How much version skew can we support between FCP and clusters

I think if we do all of the above i.e we are able to detect when a cluster does not support a field set by the user and we give user the choice of what to do in that case, then we would not have to limit FCP version to min of all cluster versions.

Let me know if I missed anything.

ghost commented 7 years ago

@nikhiljindal Yup, I think you got most of it. Two minor comments:

Re 4, And besides that, I don't think it's at all feasible to mandate that the FCP always gets updated before it's member clusters. Because members clusters are administered by different people/organizations than the FCP. So clusters can be upgraded or downgraded at any time. Clusters and their administrators might be completely unaware that their cluster has been added to a federation, and that is the right thing to do, given the intentional decoupling of the FCP and the clusters.

Regarding 1 above, I also suggested an option of FCP controllers using a correctly versioned client library for interacting with each cluster, and the same versioned client library to interact with the FCP API. I'm not sure, but I suspect that this is morally equivalent to using the swagger spec.

marun commented 7 years ago

@nikhiljindal How would the swagger spec be consumed by the FCP? Would a member cluster's api automatically serve it?

Issue 2 suggests yet another use case for enhanced status reporting. Currently logs provide the only indication of a resource's propagation status, and logs are unlikely to be accessible to a non-admin user.

nikhiljindal commented 7 years ago

from @quinton-hoole

Re 4, And besides that, I don't think it's at all feasible to mandate that the FCP always gets updated before it's member clusters. Because members clusters are administered by different people/organizations than the FCP. So clusters can be upgraded or downgraded at any time. Clusters and their administrators might be completely unaware that their cluster has been added to a federation, and that is the right thing to do, given the intentional decoupling of the FCP and the clusters.

Thats very well said and I agree. We should be able to detect version skew and define what happens in that case (point 2 in my summary).

from @marun

@nikhiljindal How would the swagger spec be consumed by the FCP? Would a member cluster's api automatically serve it?

It already does. kubectl also uses swagger spec served by the api server to do client side validation. You can access it at <masterIP>/swaggerapi

marun commented 7 years ago

It already does. kubectl also uses swagger spec served by the api server to do client side validation. You can access it at /swaggerapi

Does that mean that using the swagger spec would enable discovery of field differences for any api version? And can the kubernetes client also configure itself using swagger such that the benefits of @quinton-hoole's proposal could be realized without having to compile each specific client version into the controller manager?

liggitt commented 7 years ago

It already does. kubectl also uses swagger spec served by the api server to do client side validation. You can access it at /swaggerapi

This validation is error-prone and we're trying to remove it - https://github.com/kubernetes/kubernetes/issues/39434

nikhiljindal commented 7 years ago

from @marun

Does that mean that using the swagger spec would enable discovery of field differences for any api version?

Yes. We will be able to detect what resources and fields are supported by each cluster using its swagger spec.

And can the kubernetes client also configure itself using swagger

Am not sure what you mean by kubernetes client "configuring itself" using swagger, but I dont see how can the typed client that we use change its codec or any version conversion logic. Is that what you meant?

marun commented 7 years ago

@nikhiljindal

Am not sure what you mean by kubernetes client "configuring itself" using swagger, but I dont see how can the typed client that we use change its codec or any version conversion logic. Is that what you meant?

That's what I was asking, yes. It would seem useful for the client to avoid hard-coding conversion logic and being able to configure itself to work with any given server by consuming swagger.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

irfanurrehman commented 6 years ago

This issue was labelled only for sig/multicluster and is thus moved over to kubernetes/federation#209. If this does not seem to be right, please reopen this and notify us @kubernetes/sig-multicluster-misc. /close