kubectl rollout undo should warn about undefined behaviour with kubectl apply

q42jaap commented 4 years ago

What would you like to be added: Make kubectl rollout undo and kubectl apply work together. kubectl rollout undo doesn't revert the kubectl.kubernetes.io/last-applied-configuration annotation, which leads to problems if named objects in arrays are intended to be removed.

Why is this needed: I ran into the issue that I had an old container still being present after applying a rename of a sidecar. Symptoms look like these issues: https://github.com/kubernetes/kubernetes/issues/63982 and https://github.com/kubernetes/kubernetes/issues/25390.

What happened to me, is as follows:

deployment has containers a + b
I wanted to rename b to c and change c's implementation
I ran kubectl apply with the changed yaml with a version of c which contains an error
rollout status doesn't finish, because the pod goes into a CrashLoopBackOff
I see the failure occur and decide to revert to the last version: kubectl rollout undo
the deployment is reverted and the new replica set is scaled down
however the kubectl.kubernetes.io/last-applied-configuration annotation on the deployment's metadata is not rolled back
the next time I deploy the a + c yaml for the deployment, b is not in the last applied configuration so kubernetes config merge algorithm doesn't remove b
the pods are running with containers a + b + c

The rollout undo command doesn't rollback kubectl.kubernetes.io/last-applied-configuration which doesn't make sense imho.

I reported this to Google Support, with a reproduction, their answer was this:

We do not typically recommend using "rollout" to be honest, in general we would recommend re-applying the old config.

There are a lot of cases where the old config no long is available to the user, especially since rollout undo is only done when there is something going wrong. At these times you want kubectl to be your friend who has your back…

q42jaap commented 4 years ago

/sig cli

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubectl/issues/1301): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

pacoxu commented 3 years ago

[root@daocloud ~]# kubectl rollout history deployment/mysql
deployment.apps/mysql
REVISION  CHANGE-CAUSE
2         <none>
3         custom message revision 3
5         custom message revision 5
6         custom message revision 6

[root@daocloud ~]# kubectl rollout  undo deployment/mysql
deployment.apps/mysql rolled back
[root@daocloud ~]# kubectl rollout history deployment/mysql
deployment.apps/mysql
REVISION  CHANGE-CAUSE
2         <none>
3         custom message revision 3
6         custom message revision 6
7         custom message revision 5

I think this is something expected. When you do rollout undo, the previous apply is applied again and the change cause is changed to previous one.

Not sure how we can improve the current behavior here.

If you want change the change-cause after rollout undo, you can update it as below

Annotating the Deployment with kubectl annotate deployment.v1.apps/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"

johngmyers commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@johngmyers: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubectl/issues/1301): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-ci-robot commented 3 years ago

@q42jaap: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

johngmyers commented 3 years ago

/remove-lifecycle rotten

I also ran into this. A service had changed the names of containers and volumes and that apply had gotten rolled back. Upon rolling forward again, the old containers/volumes were not removed.

johngmyers commented 3 years ago

Reproduction procedure:

cat >sample.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
  labels:
    app: test-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
        - name: foo
          image: gcr.io/google_containers/pause-amd64:3.0
EOF
cat >sample-new.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
  labels:
    app: test-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
        - name: bar
          image: gcr.io/google_containers/pause-amd64:3.0
EOF
kubectl apply -f sample.yaml
kubectl apply -f sample-new.yaml
kubectl rollout undo deployment/test-app
kubectl apply -f sample-new.yaml

Expected result: Deployment containing one container named bar. Actual result: Deployment containing two containers: foo and bar.

johngmyers commented 3 years ago

@pacoxu your comment does not address the issue as described. This is about kubectl.kubernetes.io/last-applied-configuration, not kubernetes.io/change-cause.

johngmyers commented 3 years ago

/kind bug /remove-kind feature

pacoxu commented 3 years ago

I prefer that it is something expected.

kubectl.kubernetes.io/last-applied-configuration should be added when the user execs kubectl apply. kubectl rollout may not change the label.

/cc @eddiezane

johngmyers commented 3 years ago

In the current behavior, after a kubectl rollout undo the kubectl.kubernetes.io/last-applied-configuration refers to a configuration that is no longer the last applied.

As I have demonstrated, the current behavior causes subsequent kubectl apply commands to sometimes put the resource into an unexpected state.

adamdougal commented 3 years ago

This is the cause of this issue https://github.com/kubernetes/kubernetes/issues/77058

KnVerey commented 2 years ago

We discussed this at last week's SIG meeting and will follow up again next week after taking some time to reflect. You can listen to the conversation starting at https://youtu.be/Wh6eU1AfHBY?t=294. Notes follow.

This is happening because the implementation of kubectl rollout undo essentially copies the spec of the targeted ReplicaSet into the Deployment, without touching the last-applied annotation. That means last-applied continues to reflect the spec from the apply operation that created the newer ReplicaSet we moved off of, not the older one whose spec is now back in the Deployment. Because of this, the strategic merge calculation performed by the next kubectl apply will be making an illogical and potentially invalid comparison. In the case where the set of fields under management differ among the three points in time in question, apply may draw conclusions about field ownership that are perfectly correct given the inputs, but incorrect given the real history / actual user intent.

In @johngmyers example, when the rollback happens, the Deployment ends up in a state where the spec has container "foo", and the last-applied annotation has container "bar". When apply subsequently runs with container "bar" added again, it sees only "bar" in the annotation. It therefore assumes that a different actor added "foo" and it should be left alone.

This is impossible to fix with code alone: we'd need to start keeping more historical data than we do today. Specifically, we'd need to preserve the last-applied information by copying it to ReplicaSet. It is deliberately not being copied over today, because doing so would cause apply to think it manages the ReplicaSet directly (it doesn't). However, we could copy the data over into a different annotation, and then perform the reverse translation when copying back during a rollback. Something similar would need to be done for the newer metadata.managedFields used by server-side apply as well. Overall, it would be a significant increase in the size of all ReplicaSet objects created from Deployments managed by apply, for all users. We need to carefully consider whether this bug fix is worth that non-trivial cost.

SIG-CLI often faces challenges related to the interaction between the imperative and declarative commands kubectl provides. For example, making changes with kubectl edit between apply operations is capable of causing a very similar problem. My personal opinion is that rollout undo falls in a grey area between the two types of command. The user's intent in this case isn't to make a manual change; it's to roll back the entire Deployment object to its previous state. Where the apply data is stored is an implementation detail they may have no awareness of, and there's nothing that signals to them that the rollback will be incomplete in an important way. Also unlike the crud imperative commands, rollout does not strike me as something we can call intended for experimental rather than production use. On the contrary, it sounds like something you should be able to reach for in an emergency.

Points to reflect on and come back to at next week's meeting:

What is the purpose of this command? If we feel it should never be used in production, why are we keeping it?
Is the price we'd pay in storage to fix this worth the benefit?
If not, are there any other solutions we haven't considered?
If not, how can we make this safer to use? (e.g. emit warnings when relevant? provide repair instructions in docs?)

cc @seans3 @eddiezane @ant31

KnVerey commented 2 years ago

One more note: the core k8s docs promote rollout undo as the way to roll back changes to a Deployment or DaemonSet, no caveats provided. In fact, the DaemonSet version says:

kubectl rollout undo is equivalent to updating DaemonSet template to a previous revision through other commands, such as kubectl edit or kubectl apply.

cc @janetkuo for your thoughts as the original author of this feature

janetkuo commented 2 years ago

kubectl rollout undo is an imperative command, similar to kubectl edit, and not intended to be used with kubectl apply in its original design. For kubectl apply users, modifying configs (e.g. revert changes) and then apply changes through kubectl apply is the declarative way to do a rollback.

KnVerey commented 2 years ago

As promised, we discussed this again at today's SIG-CLI meeting. You can watch the conversation starting at https://youtu.be/l2plzJ9MRlk?t=2517.

The conclusion was that the distinction between imperative and declarative workflows is important and justifies the related stance that this command is behaving as expected and should not be changed. That said, we're not doing a good enough job of communicating that distinction to users and helping them avoid problems with the interaction between the two strategies. Here are the potential action items that came out of the conversation:

Add a note to the rollout undo docs about the incompatibility with apply.
Detect that a resource targeted by rollout undo was configured with apply and emit a warning, potentially referring to the above docs. We already do something similar when you apply a resource that was created with create.
Audit all our imperative commands and make sure they emit warnings like the above when we can detect the user switching resource management strategies.
Write a blog post on imperative vs. declarative commands / config strategies.
Include a segment on imperative vs. declarative commands / config strategies in the SIG-CLI maintainers' track talk at the next KubeCon.

johngmyers commented 2 years ago

If kubectl rollout undo is not intended to work with kubectl apply then it should remove kubectl apply's annotation, not leave it in an incorrect state.

mishamo commented 2 years ago

Would there be appetite for the "expected" behaviour under a different command that could be marked as "safe for use with the declarative workflow"? Perhaps something like kubectl apply rollback?

johngmyers commented 2 years ago

How would an imperative workflow configure a resource that could then safely be targeted by rollout undo, if not with apply?

KnVerey commented 2 years ago

If kubectl rollout undo is not intended to work with kubectl apply then it should remove kubectl apply's annotation, not leave it in an incorrect state.

Unfortunately that would exacerbate the problem for the users this affects, since the result would be apply not owning any of the fields it previously did as of the rollback.

Would there be appetite for the "expected" behaviour under a different command that could be marked as "safe for use with the declarative workflow"? Perhaps something like kubectl apply rollback?

No, because the declarative way to roll back is to re-apply the previous state, as @soltysh reiterated in the meetings.

How would an imperative workflow configure a resource that could then safely be targeted by rollout undo, if not with apply?

Apply by definition is not imperative. If you choose imperative configuration management as your strategy (create/edit/patch/replace/set etc.), then this bug report is irrelevant to you. The problem only happens if you are already using apply, then use rollout undo, then use apply again. The docs contrasting these techniques are here: https://kubernetes.io/docs/concepts/overview/working-with-objects/object-management.

mishamo commented 2 years ago

No, because the declarative way to roll back is to re-apply the previous state, as @soltysh reiterated in the meetings.

That makes sense if the previous state is readily available. If it isn't available from some other source (e.g. git history or a past helm chart etc) and it is available through the last-applied-configuration annotation, it feels like an invaluable convenience to be able to rollback in this manner.

KnVerey commented 2 years ago

That makes sense if the previous state is readily available. If it isn't available from some other source (e.g. git history or a past helm chart etc) and it is available through the last-applied-configuration annotation, it feels like an invaluable convenience to be able to rollback in this manner.

It isn't ever available in the last-applied annotation. That only contains the state that was applied in creating the current version. No information is stored about the state applied to create any previous versions. That's why this incompatibility with apply exists and could not be resolved without a massive increase in storage usage, as explained in my first comment.

johngmyers commented 2 years ago

Unfortunately that would exacerbate the problem for the users this affects, since the result would be apply not owning any of the fields it previously did as of the rollback.

When a procedure is invalid, having it fail frequently and noticeably is better than having it rarely fail. Procedures that reliably fail teach people to not perform them. Things that rarely fail, in subtle ways, are what lead to disasters.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubectl/issues/1301): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sftim commented 2 years ago

I'm going to reopen this so I can transfer it to the kubectl repo.

sftim commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@sftim: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubectl/issues/1301): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sftim commented 2 years ago

/transfer kubectl

sftim commented 2 years ago

/remove-lifecycle rotten /lifecycle stale

eddiezane commented 2 years ago

/assign @seans3

@seans3 will be adding a warning about a possible conflict when using rollout undo when a last applied annotation is present.

seans3 commented 2 years ago

/remove-lifecycle stale /lifecycle frozen

seans3 commented 2 years ago

/triage accepted

NOTE: As @eddiezane said, this effort will not be attempting to facilitate the interaction of kubectl rollout (an imperative command) with kubectl apply (a declartive command); it will only add a warning message to not mix the two commands.

KnVerey commented 2 years ago

/retitle kubectl rollout undo should warn about undefined behaviour with kubectl apply

(I think it is mislead to accept this as previously titled; alternatively we can revert the title and the triage and create a separate issue for the warning)

Gikkman commented 2 years ago

Down the line, it might be a good idea to consider adding a similar warning when you mix any command from different groups. You could for example annotate the deployment with say command-group: declarative or command-type: imperative and warn based of of that or something. Or if there's some other metadata somewhere.

With that said, rollout undo is probably one of the more common and/or important case though so it makes sense to start there.

sftim commented 2 years ago

This issue is now about issuing a specific warning. Feel free to log separate issues for any different improvements folks would like to suggest.

k8s-triage-robot commented 10 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

kubernetes / kubectl

kubectl rollout undo should warn about undefined behaviour with kubectl apply #1301