Open mavenugo opened 8 years ago
While this feature was temporarily disabled for 1.12, there is support for the scheduler to make scheduling decisions based on which network plugins are required by the network allocation.
For this to work properly in the proposed model, we might have to run tasks through the allocator twice (before and after scheduling).
Allocator moves task to Allocated
and scheduler moves them to Assigned
. Assigned
tasks are visible to agents right away. This proposal would have impact to this model. Can IPAM (or other services) to be called by scheduler to preserve the state machine?
There may also be situations involving volumes which requires PRE and POST scheduling allocation. I'm wondering if we've made a misstep by representing allocation as a pipeline step, when it really should be part of the lifecycle hook.
+---------------------------------+
| |
| Allocator |
| |
| +-------------------+ |
| | | |
| | | |
+----------------+ | | +-------------+ | | +--------------+
| | | | | | | | | |
| | | | | | | | | |
| Orchestrator +------------> Scheduler +------------> Dispatcher |
| | | | | | | | | |
| | | | | | | | | |
+----------------+ | | +-------------+ | | +--------------+
| | | |
| | | |
| | | |
+------+ +------+
Allocator moves task to Allocated and scheduler moves them to Assigned. Assigned tasks are visible to agents right away. This proposal would have impact to this model. Can IPAM (or other services) to be called by scheduler to preserve the state machine?
This can be mitigated through desired state control. If the state is assigned and the desired state is assigned, the agent won't do anything with the task.
In general can we have hooks to decorate the docker container runtime options executed in the node itself with safe guards not to conflict with certain resource options. Swarmkit has concepts of executors similarly to mesos and it would be nice to have some extension points in the context of docker to get some flexibility.
+1
Being able to support MACVLAN and IPVLAN in SwarmKit would help enterprise users adopt SwarmKit-based orchestration systems. Agnostic on the specific implementation, i.e., @stevvooe's pre/post hooks vs @mavenugo's proposal to swap the order of the two operations.
proposal to swap the order of the two operations.
I think the statically pipelined allocator model will always break. We need to look at allocation as something that can happen in the course of regular operations. The pipelined allocation model creates odd dependency loops that lead to weird solutions.
Status?
I think that if you are able to allocate after scheduling we could have a resource allocator that takes into account the reserved resources by the containers and use this information to improve the scheduling algorithm by not assigning tasks with resource requirements higher than the available resources in the node. I think this is not being considered in the manager and it could allow for more optimized orchestration.
Related to implementation: there are two important states for each allocator.
In the current network allocator these states are New
& Assigned
if I'm not mistaken, as the scheduler does not need the network to be allocated nor is the scheduling needed by the allocator.
So, if allocating was run out of the main pipeline, could this result in optimization by parallelization? This is, allocate the network while the scheduler is selecting the target node.
MANAGER NODE (LEADER) WORKER NODE
_______________ ______________ ______________ ______________ _______________
\ \ \ \ \ \ \
| Orchestrator | Allocator | Scheduler | Dispatcher | | Agent |
/______________/______________/______________/______________/ /______________/
| | | |
/--+--\ /----+----\ /-----+-----\ /-----+-----\
| New | | Pending | | Assigned | | Accepted |
\-----/ \---------/ \-----------/ | Preparing |
| Ready |
| Starting |
| Running |
| Complete |
| Shutdown |
| Failed |
| Rejected |
| Remove |
| Orphaned |
\-----------/
MANAGER NODE (LEADER) WORKER NODE
_______________ ______________ ______________ _______________
\ \ (allocator) \ \ \ \
| Orchestrator | Scheduler | Dispatcher | | Agent |
/______________/______________/______________/ /______________/
| | |
/--+--\ /-----+-----\ /-----+-----\
| New | | Scheduled | | Accepted |
\-----/ \-----------/ | Preparing |
| Ready |
| Starting |
| Running |
| Complete |
| Shutdown |
| Failed |
| Rejected |
| Remove |
| Orphaned |
\-----------/
@mavenugo as far I understand this one should be already handled on https://github.com/docker/swarmkit/blob/master/manager/scheduler/filter.go#L139-L183 as it checks that node have needed plugins.
Or was there some other thing which I forgot?
There are a few cases (such as vlan drivers) in which network resources (IPAM in particular) that are allocated depends on the node at which the task will be dispatched. In the current model, allocator happens prior to scheduler and hence these network drivers will not be able to allocate node level resources. By swapping allocator & schedular in the pipeline, the allocation module can make use of the scheduling decision and let the IPAM plugin control the allocation decision.