moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.36k stars 615 forks source link

Swap allocator & scheduler prior to dispatcher #1477

Open mavenugo opened 8 years ago

mavenugo commented 8 years ago

There are a few cases (such as vlan drivers) in which network resources (IPAM in particular) that are allocated depends on the node at which the task will be dispatched. In the current model, allocator happens prior to scheduler and hence these network drivers will not be able to allocate node level resources. By swapping allocator & schedular in the pipeline, the allocation module can make use of the scheduling decision and let the IPAM plugin control the allocation decision.

aaronlehmann commented 8 years ago

While this feature was temporarily disabled for 1.12, there is support for the scheduler to make scheduling decisions based on which network plugins are required by the network allocation.

For this to work properly in the proposed model, we might have to run tasks through the allocator twice (before and after scheduling).

dongluochen commented 8 years ago

Allocator moves task to Allocated and scheduler moves them to Assigned. Assigned tasks are visible to agents right away. This proposal would have impact to this model. Can IPAM (or other services) to be called by scheduler to preserve the state machine?

stevvooe commented 8 years ago

There may also be situations involving volumes which requires PRE and POST scheduling allocation. I'm wondering if we've made a misstep by representing allocation as a pipeline step, when it really should be part of the lifecycle hook.

                    +---------------------------------+
                    |                                 |
                    |            Allocator            |
                    |                                 |
                    |      +-------------------+      |
                    |      |                   |      |
                    |      |                   |      |
+----------------+  |      |  +-------------+  |      |  +--------------+
|                |  |      |  |             |  |      |  |              |
|                |  |      |  |             |  |      |  |              |
|  Orchestrator  +------------>  Scheduler  +------------>  Dispatcher  |
|                |  |      |  |             |  |      |  |              |
|                |  |      |  |             |  |      |  |              |
+----------------+  |      |  +-------------+  |      |  +--------------+
                    |      |                   |      |
                    |      |                   |      |
                    |      |                   |      |
                    +------+                   +------+

Allocator moves task to Allocated and scheduler moves them to Assigned. Assigned tasks are visible to agents right away. This proposal would have impact to this model. Can IPAM (or other services) to be called by scheduler to preserve the state machine?

This can be mitigated through desired state control. If the state is assigned and the desired state is assigned, the agent won't do anything with the task.

mbdas commented 8 years ago

In general can we have hooks to decorate the docker container runtime options executed in the node itself with safe guards not to conflict with certain resource options. Swarmkit has concepts of executors similarly to mesos and it would be nice to have some extension points in the context of docker to get some flexibility.

ghost commented 8 years ago

+1

Being able to support MACVLAN and IPVLAN in SwarmKit would help enterprise users adopt SwarmKit-based orchestration systems. Agnostic on the specific implementation, i.e., @stevvooe's pre/post hooks vs @mavenugo's proposal to swap the order of the two operations.

stevvooe commented 8 years ago

proposal to swap the order of the two operations.

I think the statically pipelined allocator model will always break. We need to look at allocation as something that can happen in the course of regular operations. The pipelined allocation model creates odd dependency loops that lead to weird solutions.

Adirio commented 6 years ago

Status?

I think that if you are able to allocate after scheduling we could have a resource allocator that takes into account the reserved resources by the containers and use this information to improve the scheduling algorithm by not assigning tasks with resource requirements higher than the available resources in the node. I think this is not being considered in the manager and it could allow for more optimized orchestration.

Related to implementation: there are two important states for each allocator.

  1. Start: when the allocator can be executed.
  2. Finish: when the allocator needs to have finished.

In the current network allocator these states are New & Assigned if I'm not mistaken, as the scheduler does not need the network to be allocated nor is the scheduling needed by the allocator.

So, if allocating was run out of the main pipeline, could this result in optimization by parallelization? This is, allocate the network while the scheduler is selecting the target node.

Old model:

                    MANAGER NODE (LEADER)                           WORKER NODE
_______________ ______________ ______________ ______________     _______________
\              \              \              \              \    \              \
 | Orchestrator |  Allocator   |  Scheduler   |  Dispatcher  |    |    Agent     |
/______________/______________/______________/______________/    /______________/
               |              |              |                           |
            /--+--\      /----+----\   /-----+-----\               /-----+-----\
            | New |      | Pending |   | Assigned  |               | Accepted  |
            \-----/      \---------/   \-----------/               | Preparing |
                                                                   | Ready     |
                                                                   | Starting  |
                                                                   | Running   |
                                                                   | Complete  |
                                                                   | Shutdown  |
                                                                   | Failed    |
                                                                   | Rejected  |
                                                                   | Remove    |
                                                                   | Orphaned  |
                                                                   \-----------/

Proposed model:

             MANAGER NODE (LEADER)                   WORKER NODE   
_______________ ______________ ______________     _______________
\              \  (allocator) \              \    \              \
 | Orchestrator |  Scheduler   |  Dispatcher  |    |    Agent     |
/______________/______________/______________/    /______________/
               |              |                           |
            /--+--\     /-----+-----\               /-----+-----\
            | New |     | Scheduled |               | Accepted  |
            \-----/     \-----------/               | Preparing |
                                                    | Ready     |
                                                    | Starting  |
                                                    | Running   |
                                                    | Complete  |
                                                    | Shutdown  |
                                                    | Failed    |
                                                    | Rejected  |
                                                    | Remove    |
                                                    | Orphaned  |
                                                    \-----------/
olljanat commented 6 years ago

@mavenugo as far I understand this one should be already handled on https://github.com/docker/swarmkit/blob/master/manager/scheduler/filter.go#L139-L183 as it checks that node have needed plugins.

Or was there some other thing which I forgot?