[Proposal] Resolving the IP address contention issue

nishanttotla commented 7 years ago

What is the Issue?

When tasks are removed from SwarmKit, they're immediately removed from the object store, along with all the resources allocated to them (this includes IP addresses). But, actual task containers can take much longer to shut down, as they finish their shutdown procedures. This means that IP addresses might be held for longer. If new tasks get created in the meantime, they may be allocated the same IP address, leading to a conflict, or just a wonky state until old tasks finish their shutdown.

Existing Mitigations

There were two fixes that were recently added to alleviate this problem

Putting used IP addresses to the end of the queue, so that it's less likely that an already used IP address is allocated to a new task. For deployments with rapid creation of new tasks, or small subnets, this doesn't do much.
On the overlay network, if the same IP is used for some reason, then there is better handling so that the configuration in the kernel is consistent across nodes. This helps when the subnet is almost exhausted, but hides the fact that such a thing is happening. This might make it harder to debug the network if issues do arise.

While these mitigations are reasonable, they don't fix the root cause, and don't guarantee that the issue will not come up.

Long Term Fix

The long term fix involves not freeing up resources immediately when the task gets removed. This is a rough draft of the fix, which might evolve as it becomes clearer:

When a task is removed,

Update desired state for the task to REMOVED (this is a new task status), instead of removing it from the object store
Dispatcher tells the agent that desired state has been updated to removed, and the agent is responsible for going through task shutdown
The task reaper only removes the task when desired state is REMOVED and actual state > RUNNING (the task reaper is the only place where a task should be removed)

We may or may not revert the mitigations once the long term fix is in, but we should test without the mitigations. Additionally, we will need to decide how the API deals with REMOVED tasks (whether they are returned by default or not). But we can get to these questions later.

anshulpundir commented 7 years ago

Spoke with @nishanttotla about this offline, but capturing here too: we should look at the different states available and see how does it all works together with the new proposed state of REMOVED. Maybe there is opportunity for consolidation of behavior, or maybe this can be named something more generic like CLEANUP, which is an indication to the task reaper to clean this task up.

Also:

Dispatcher tells the agent that desired state has been updated to removed

Does the dispatcher tell the agent or is this done indirectly through a watch ?

nishanttotla commented 7 years ago

Based on https://github.com/docker/swarmkit/pull/2408#issuecomment-336647568, I'm also wondering if we can somehow utilize the ORPHANED state to free up resources, instead of defining a new state like @anshulpundir pointed out above.

cc @aaronlehmann

anshulpundir commented 7 years ago

Maybe this could work. We could potentially rename ORPHANED to something more generic e.g. CLEANUP and use it for all the cases where a cleanup is required.

BTW, when nodes are removed, why are they not kept around as a part of the history ? I'm trying to understand swarmkit's history policy. @nishanttotla @aaronlehmann

aaronlehmann commented 7 years ago

Aside from some corner cases like deleting a node, we don't remove tasks before they shut down. The current general flow is:

Manager sets the desired state to "shutdown"
Worker shuts down the task
The actual state gets reported as "shutdown"
The manager frees resources such as IP addresses associated with the task
Depending on the task retention policy, the task object may be deleted in the object store.

The "orphaned" state only comes into play when the node is unresponsive, and doesn't report that the task has shut down. To avoid holding onto the resources for ever, they get "timed out", I think after 24 hours.

I vaguely remember hearing about a race on the worker side, where it would report that the task had shut down before it truly freed the network resources associated with it. It had something to do with libnetwork doing this resource deallocation asynchronously. So that may be the real source of the problem. I don't think we need any additional states to make the resource deallocation robust. We just need to make sure that tasks are not deleted prematurely, and that resources are not still in use on the worker side when the worker reports that a task has shut down.

aaronlehmann commented 7 years ago

Another source of problems could be scale-down. We delete tasks immediately in this case. That was done because otherwise the no-longer-alive tasks would clutter the docker service ps output indefinitely. It might be better to mark those tasks as scaled-down in some way, and have the docker service ps command ignore them.

anshulpundir commented 7 years ago

Clarified the use-case with @aluzzardi just now. There's two, both for replicated services: 1. service delete, and 2. scale-down for replicated services.

Service deletes: I see why we delete tasks right away in this case: if we mark the tasks shutdown, there is no way for the reaper to clean these up. So, they are directly removed. Although I am not sure how adding a new state would help here.
Scale down, @aaronlehmann explanation above. I agree with the proposal: decluttering docker service ps outputis not a strong enough motivation to design the system in this way and it should be handled another way, likely by letting tasks shutdown before cleanup.

@nishanttotla

anshulpundir commented 7 years ago

Possible solution for service deletes: we delete service object, and set the desired state for its task to shutdown. Then, when the current state changes to shutdown, we mark it orphaned (since the service is already gone).

For scale down, we also set the desired state to shutdown. TBD on if/how to handle docker ps output.

aaronlehmann commented 7 years ago

Overall that sounds good to me. I'm not sure about marking the tasks orphaned though. Orphaned is used to override the current state when it doesn't update in time. If it does update to shutdown, why do you need to mark it orphaned?

anshulpundir commented 7 years ago

If it does update to shutdown, why do you need to mark it orphaned?

In case of service delete, my understanding is that otherwise these tasks would never be deleted since the task reaper can only look at existing services. Even for the scale down case, the task reaper will not clean up the slot which has been scaled down. Please correct me if I am wrong. @aaronlehmann

aaronlehmann commented 7 years ago

I don't think setting a service task to orphaned will cause it to get deleted either. The task reaper would need to deal with this case where the service doesn't exist anymore.

If it does update to shutdown, why do you need to mark it orphaned?

In case of service delete, my understanding is that otherwise these tasks would never be deleted since the task reaper can only look at existing services. Even for the scale down case, the task reaper will not clean up the slot which has been scaled down. @aaronlehmann

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

anshulpundir commented 7 years ago

The task reaper would need to deal with this case where the service doesn't exist anymore.

Completely agree. This would be the clean way of doing it. Setting the task to orphaned in this case is hacky.

However, I see that task reaper listens for task update events. So, when a task is set to orphaned, I believe it should be cleaned up the task reaper.

aaronlehmann commented 7 years ago

However, I see that task reaper listens for task update events. So, when a task is set to orphaned, I believe it should be cleaned up the task reaper.

I don't agree because tasks shouldn't disappear from task history because the node where they were running went down for a long period of time (which is the original purpose of the orphaned state).

I don't think scale down or service delete has anything to do with orphaned tasks.

anshulpundir commented 7 years ago

I don't agree because tasks shouldn't disappear from task history because the node where they were running went down for a long period of time (which is the original purpose of the orphaned state).

OK yea, I see your point.

nishanttotla commented 7 years ago

Thanks for the comments, @aaronlehmann and @anshulpundir. So to summarize, what we need to do here is:

For service deletion, we delete the service and set desired state for its tasks to SHUTDOWN. Then, the task reaper needs to be updated where it can handle tasks for which the service (I'm assuming the tasks still have a reference to the service) doesn't exist anymore.
For service scale down, we need to be able to mark tasks as scaled down (should this be done through a new task state, or is there a better way?). We don't want to do what we're doing for service deletion, because the service actually still exists, as well as avoid the ORPHANED state based on the above discussion. Alternatively, does it make sense to change how this works, by keeping tasks around, and having the task reaper clean them based on the task retention limit? We can do this, and then make changes on the API to ignore these tasks. But even there, we need to have some way of distinguishing these tasks so they don't clutter the output.

I still see the usefulness of having a new state, even though it's not strictly necessary. Tasks that are in the process of shutting down and require removal is a unique situation not really captured by any of the existing states.

anshulpundir commented 7 years ago

For service scale down, we need to be able to mark tasks as scaled down (should this be done through a new task state, or is there a better way?). We don't want to do what we're doing for service deletion, because the service actually still exists, as well as avoid the ORPHANED state based on the above discussion. Alternatively, does it make sense to change how this works, by keeping tasks around, and having the task reaper clean them based on the task retention limit? We can do this, and then make changes on the API to ignore these tasks. But even there, we need to have some way of distinguishing these tasks so they don't clutter the output.

We can also set the desired state to shutdown here. When the task reaper receives the task update event for this task and the current state is shutdown, it can decide to remove this task if that slot is no longer active with the service (is there a way to check this ?). This only works if slots are guaranteed to be not reused.

nishanttotla commented 7 years ago

@anshulpundir it's worth looking at how to tell which slots are in use. I suspect it isn't hard, but I don't know. That said, I don't think this is a very robust way either, because if there is an immediate scale up after a scale down, then a slot might actually get reused quickly, leading to the scaled down task remaining there for a long time.

anshulpundir commented 7 years ago

That said, I don't think this is a very robust way either, because if there is an immediate scale up after a scale down, then a slot might actually get reused quickly, leading to the scaled down task remaining there for a long time.

Yes, like I said, it only works if slots ids are guaranteed to not be reused.

anshulpundir commented 7 years ago

BTW, we should also look at making the task reaper robust enough to clean out tasks for a slot that is no longer in use, which, to me, seems like a more robust approach. @nishanttotla

nishanttotla commented 7 years ago

Based on an offline discussion, it seems reasonable to first tackle the case of service deletion since that is more straightforward, and involves a task reaper change that is even otherwise useful. Meanwhile, we'll continue to discuss the design for the case of service scale downs.

cc @marcusmartins

aaronlehmann commented 7 years ago

We can also set the desired state to shutdown here. When the task reaper receives the task update event for this task and the current state is shutdown, it can decide to remove this task if that slot is no longer active with the service (is there a way to check this ?).

That makes sense to me. I can't think of an existing way to check if a task was sacrificed during a scale-down operation and therefore should be removed once it finishes shutting down, so something like this would have to be designed. It could be as simple as a flag on the Task structure. Maybe check with @aluzzardi - I'm sure he has thoughts on this.

I was thinking of suggesting the alternative of leaving the tasks to go through the normal task reaper flow, and hiding the scaled-down ones in the UI, but thinking about it some more, if there's no reliable way to find scaled-down tasks in the task reaper, there wouldn't be a reliable way to filter them in the UI, either. So either way it seems we'd need a new mechanism to identify these tasks, and then we might as well handle it internally in the task reaper because otherwise it would push out more responsibilities to API users.

aaronlehmann commented 7 years ago

Another possible idea is introducing a new state, after shutdown and orphaned, called TaskStateRemove. The orchestrator would set the desired state to TaskStateRemove, meaning that the the task should proceed to shut down, and then once it does, it should be removed. It would be very easy to change the task reaper so that if it sees tasks with desired state Remove and current state >= Shutdown, it removes them right away.

I think is what @nishanttotla was suggesting earlier when he mentioned adding a new state.

I like this idea because it fits neatly with the task lifecycle model and the idea of desired state.

aluzzardi commented 7 years ago

Agreed with @aaronlehmann - that was the original proposal I made offline: introducing a new desired state “remove”.

1) This should pretty much work out of the box on the agent side, it’s >RUNNING so it should shut it down and report back, with the additional caveat that the engine will delete it 2) Not many changes, we just need to make sure no one ever deletes a task directly (e.g. scale down) but instead changes the state to remove 3) The reaper (or something else, but probably the reaper) will take care of removing tasks where desired==remove && current > running 4) If a node is down and the agent doesn’t report back, tasks will be marked as orphaned (which is still greater than running, so the above condition is still true)

I believe it is the only option. We can’t use “shutdown” because we do not want to remove tasks that are shut down. Shutdown and Remove must have a different behavior, so I don’t believe they can be merged into a single state.

anshulpundir commented 7 years ago

I believe it is the only option. We can’t use “shutdown” because we do not want to remove tasks that are shut down. Shutdown and Remove must have a different behavior, so I don’t believe they can be merged into a single state.

From an orchestrator point of view, the behavior between 'shutdown' and the new proposed state 'remove' is not different. Since the orchestrator and only the orchestrator assigns states, I think it makes sense to design states from the orchestrator point of view.

aluzzardi commented 7 years ago

From an orchestrator point of view, the behavior between 'shutdown' and the new proposed state 'remove' is not different.

I'm not sure I understand.

Let's take service foo. It has 2 instances.

1) Slot 1 has 2 tasks, that was due to a service update while going from one version to another version of the image: Therefore, one task has a desired state of running while the older task has a desired state of shutdown 2) Slot 2 has one task.

Service ps looks like this:

ID                  NAME                IMAGE               NODE                    DESIRED STATE       CURRENT STATE             ERROR               PORTS
ozuja9t0x19u        foo.1               nginx:1             linuxkit-025000000001   Running             Running 18 seconds ago
jl8e8tnq7arq         \_ foo.1           nginx:latest        linuxkit-025000000001   Shutdown            Shutdown 19 seconds ago
c8xfyqwotc8r        foo.2               nginx:1             linuxkit-025000000001   Running             Running 5 seconds ago

Now let's assume we scale back to 1 replica (therefore we want to remove the task in the second slot).

If we mark that task (foo.2) as shutdown like so:

ID                  NAME                IMAGE               NODE                    DESIRED STATE       CURRENT STATE             ERROR               PORTS
ozuja9t0x19u        foo.1               nginx:1             linuxkit-025000000001   Running             Running 18 seconds ago
jl8e8tnq7arq         \_ foo.1           nginx:latest        linuxkit-025000000001   Shutdown            Shutdown 19 seconds ago
c8xfyqwotc8r        foo.2               nginx:1             linuxkit-025000000001   Shutdown             Running 5 seconds ago

How's the reaper going to remove foo.2 from the system while not removing foo.1 (jl8e8tnq7arq) as well?

That's the difference between the two tasks?

anshulpundir commented 7 years ago

Chatted with @aluzzardi offline. While the task reaper behavior is different, I was assuming that that from the orchestrator point of view, both and will have the same effect, which is to shutdown the task through the agent. I wasn't aware that the agent behavior is also different: 'shutdown' doesn't remove the container, whereas 'remove' should. So I see that adding a new state for scale down make sense.

stevvooe commented 6 years ago

So I see that adding a new state for scale down make sense.

A new state is not needed here. There is already a shutdown state. Remove is implicit by removing from the assignment set. When the task is removed from the assignment set, Remove gets called. This ensures that whether or not the orchestrator knows about a task, the agent removes its resources.

Since the orchestrator and only the orchestrator assigns states, I think it makes sense to design states from the orchestrator point of view.

This is not true, at all. Actually, task state is assigned and observed by the agent once the state is greater than ASSIGNED (you can see my talk on swarmkit for details on field ownership).

We actually had a REMOVE state originally, but it caused problems with the assignment set, since it is essentially the same thing. The issue here is that the orchestrator can't wait forever while an agent may or may not remove a task. This can't be reconciled cleanly without set-based semantics.

It seems like the tasks are being removed too aggressively in the orchestrator, so we should add functionality to ensure they progress the the dealloc lifecycle before being removed from storage.

nishanttotla commented 6 years ago

Remove is implicit by removing from the assignment set. When the task is removed from the assignment set, Remove gets called. This ensures that whether or not the orchestrator knows about a task, the agent removes its resources.

Can you elaborate on this? Who removes the task from the assignment set?

stevvooe commented 6 years ago

Can you elaborate on this? Who removes the task from the assignment set?

The dispatcher, on behalf of the orchestrator. When the agent find that it has a task outside the current assignment set, it will remove the task from the local agent. This ensures that the agent can collect tasks that the orchestrator no longer has knowledge of.

nishanttotla commented 6 years ago

Quoting from @stevvooe's comment https://github.com/docker/swarmkit/pull/2446#issuecomment-345102994 here so that we have full context in this issue:

As I stated in the other issue, adding a new state for removal doesn't make a whole lot of sense. There is already the functionality of removing the task from the assignment set that activates the removal workflow. We can tell that this state is unnecessary, because it is just a noop on the agent and no actual removal is happening. States are meant to set a target for a state and then the agent achieves that target.

The main problem that is addressed here is that the orchestrator needs to confirm shutdown (some sort of ack) before releasing the ip address to another container. If a task is reported as COMPLETED or SHUTDOWN, the ip address should be released. However, it seems we remove tasks before confirming their final state and that is likely the root cause of the bug.

This is correct.

What I am not understanding with this PR is why you think you need a removal state. From the changes to the agent, it is clear that the shutdown (or completed, if it is already down) state is sufficient.

The primary need for a new state (also discussed in some comments above) seems to be coming from the case of service scale down, where we wish to remove tasks, but also distinguish them from shutdown tasks that would otherwise be kept around in accordance with the task retention limit. For the case of service removal, I agree that we can work with the existing task model and have the task reaper handle this problem.

The correct solution here is to remove these tasks from the assignment set, so they are not sent down to the agent. The agent will proceed with the remove. You can see part of this logic here. If you want acknowledgement of cleanup, you don't need to do this. Simply set the target state to shutdown, and the task will get shutdown and release resources (like ip address). You may also be able to use the existing ORPHANED or DEAD desired state to signal the task reaper to actually remove the task.

I am going over the assignment set code, will likely have more questions about this. But on a high level, what I do not understand is how can the orchestrator indicate to the task reaper that a task is desired to be removed once it undergoes proper shutdown, as distinguished from a task that is intended to be kept around even after shutdown.

Before proceeding with a PR, it would be good to fully complete the design discussion so that we can find a proper solution.

Ping @anshulpundir @aluzzardi @aaronlehmann @dperny. Also here is the design we agreed on before: https://github.com/docker/swarmkit/issues/2407#issuecomment-340672951

stevvooe commented 6 years ago

Ping @anshulpundir @aluzzardi @aaronlehmann @dperny. Also here is the design we agreed on before: #2407 (comment)

Great, but after reviewing, I brought forth several concerns.

The trouble I see here is that the primitives are already available in the agent. Task state is about the orchestrator-agent communication, but, here we are, using it for ascertaining state in the orchestrator. Something seems wrong about that change and I have concerns about the possible side-effect. In fact, as pointed out in the PR, the agent can never actually act on the REMOVE state, as it never removes the task until it is removed from the assignment set.

Why can't we verify with the network allocator that tasks in shutdown are first freed of their resources before being considered for reaping? Effectively, it seems like we should only consider tasks for reaping once they have their resources fully released. This may require first-class state to ensure they are reaped, but that seems like the right place to look.

friism commented 6 years ago

https://github.com/docker/swarmkit/pull/2461 (carry of https://github.com/docker/swarmkit/pull/2446) has been merged - does that mean this issue has been addressed and can be closed?

nishanttotla commented 6 years ago

@friism we have addressed the root cause of the issue. We can close this issue now (there will still be some more work on testing).

moby / swarmkit