moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.35k stars 612 forks source link

Rolling update causes "out of sequence" errors for the client. #1379

Open tonistiigi opened 8 years ago

tonistiigi commented 8 years ago

Rolling update internally changes properties for the service, causing the version to increment. This means that user can randomly get "out of sequence" errors. The solution should be to have a separate version for the spec part only so that it only covers all the properties of the service that a user can change.

This issue is causing TestSwarmPublishAdd to fail in Docker CI.

cc @aaronlehmann

aaronlehmann commented 8 years ago

ping @aluzzardi

stevvooe commented 8 years ago

This is a feature. Clients that use zk or etcd have to deal with this, as well.

This allows the user to ensure that decisions are made on up to date data.

We can probably add retries if we don't like this behavior.

tonistiigi commented 8 years ago

@stevvooe "up to date data" is an arbitrary concept. Some users may want to make a decision based on the number of tasks running, the network used by service changing etc. I don't see why we should pick some random ones (UpdateStatus in this case) and provide the user a way to have a dependency on that field, while these fields are no way even visible to our client during update.

The versioning is very useful for avoiding collisions between user requests and making sure that the update matches user's intention. I'm just suggesting that we would limit that only to the fields that user can actually control.

stevvooe commented 8 years ago

@tonistiigi Anything that is observable by the user is a possible change in the data dependency, meaning that the user may have made a decision on it.

Breaking this guarantee will have untold effects on the system. If we want there to be different versioning semantics, we need to have separate objects. That is the data model. When we break the data model, we break the abstraction and make the system harder to reason about.

If we were building a consistent key value store, this wouldn't even be a question. It is wrong to apply updates to an object that are made with out of date data and usually APIs require something to get around those style of guarantees.

If we bypass this, we may as well get rid of raft, as it is a waste.

aaronlehmann commented 8 years ago

@stevvooe: I really don't understand your perspective. The user can only change the spec. Why would changes to the spec be informed by real-time progress of a rolling update? How would that even be possible with the current CLI design, since there's no check/set primitive? (docker service update doesn't give you a way to only update if the rolling update is in a certain state, and I don't see us adding that).

You're saying it's okay for updates to separate objects to have different versioning semantics, but why do you draw the line there instead of drawing it between the spec and the observed quantities? The ServiceUpdate control API only takes a service spec, not a whole service. It feels strange to me that the versioning for the broader service object comes into play for this.

stevvooe commented 8 years ago

We basically already have a check/set primitive with versioning.

The user modified fields are in the Spec and the result of that is in the surrounding object, which is versioned as a unit. This is the data model. We don't version sub-components of an object. Introducing this to services will just create an inconsistency in the API that we'll never return from.

And let's not think about the "current design". Let's actually think about the future and avoid introducing more and more micro-complexities. Just because we're not doing something today, doesn't mean we should sacrifice the properties of our data model. This solution just introduces more technical debt that others have to code around in the future.

If we want to version a separate component, let's make a separate component. I don't see what is so hard about that.

sylvainmouquet commented 7 years ago

The error occures today with docker version 17.07.0-ce

$ docker version

Client:
 Version:      17.07.0-ce
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   8784753
 Built:        Tue Aug 29 17:42:53 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.07.0-ce
 API version:  1.31 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   8784753
 Built:        Tue Aug 29 17:41:43 2017
 OS/Arch:      linux/amd64
 Experimental: false
vce-xx commented 6 years ago

@stevvooe What do you mean ? Shouldn't this be fixed ?

This is a feature.

bbotte commented 4 years ago

I use swarm, docker Server Version: 19.03.6 , when update service docker service update --with-registry-auth service_name --image images_http_address tips: Error response from daemon: rpc error: code = Unknown desc = update out of sequence

will repair?

matodrobec commented 4 years ago

Hello,

I use swarm and Docker version 19.03.8. When I update stack

docker stack deploy --with-registry-auth -c ./web.yml api-dev

I am getting:

 Updating service api-dev_webserver (id: zz5veiexnvsgleroomr2snecy)
 Updating service api-dev_data (id: v4hoc42ajz5jv17rsf7uwlcgj)
 failed to update service api-dev_data: Error response from daemon: rpc error: code = Unknown desc = update out of sequence
mhemrg commented 4 years ago

I'm using Docker Swarm's API to update services and sometimes I'm facing this issue. Is it safe to retry the operation with a new index key when I get an update out of sequence error?

trajano commented 4 years ago

I am wondering if K8S has the same issue https://stackoverflow.com/questions/61737609/does-kubernetes-suffer-from-the-update-out-of-sequence-errors-that-docker-swar

trajano commented 4 years ago

I think this sort of synchronization issues would've been solved by traditional database systems or even Kafka. I wonder if a traditional database or Kafka for scale can be used for the state backend. That would at least defer the responsibility out of Swarm Kit.