Open BoPeng opened 6 years ago
What's the advantage to send multiple queues for each task -- task
is the smallest SoS unit so how can one task runs on multiple hosts?
Sorry, I created this ticket from my cellphone without details. I have updated the ticket.
In terms of comparison between "workers" of sos and ipyparallel, I would emphasize that
apply
, apply_async
etc). More like the concurrent substeps in the same SoS step.I see, of course. Let me try to interpret the implications. One benefit implementing the ipyparallel
model would be more balanced workload DAG-wise, not locally. Another would be a more homogeneous job controlling mechanism and allows the potential to natively support more systems (like nextflow?), although it might be hard to reach to a level of sophistication of a system like PBS. Right?
Yes, the "mini-cluster" approach of nextflow means it has some built-in load balancing features and we generally avoids this problem by letting others to manage the execution of large amount of tasks.
Our model currently is
Nexflow submits multi-node jobs on clusters and execute everything in the mini-cluster, so it also does heavy computations in nextflow. It avoids the overhead of submitting and monitoring PBS jobs and can be more efficient. SoS could also do this by allowing our workers to reside on multiple computing nodes and converting them also to workers that handle jobs (substeps). We have already converted from local multiprocessing to single node zmq messaging so expanding the architecture to remote zmq workers would allow us to do something like that. That is to say, with something like
mpirun -n8 sos run script -m
we can start sos on 8 nodes and one sos instance can act as master and treat other sos instances as workers (instead of creating sos workers locally). We can ignore option -q
(or task: queue=""
) in this case and submit all tasks as concurrent jobs to workers. This can potentially be much more efficient especially for your case (large amount of small tasks/jobs).
A typical work patternn for
ipyparallel
is to start multiple local or remote workers andipyparallel
would send multiple tasks to them usingmap
,map_async
etc.ipyparallel
would handle task monitoring and load balancing. Translating to sos, it would mean something likeThis does not work right now because sos supports one queue for each task. The remote tasks are executed one by one, with task engine controlling the number of concurrent tasks on each worker (queue).
Basically,
ipyparallel
is not a workflow system. It is an (interactive) batch system more similar torq
,celery
thansos
. SoS replies on batch systems (PBS, rq etc) to handle multiple jobs. However, it is possible for SoS to handle multiple queues (workers) directly so that,sos run -q q1 q2 q3
to submit tasks to multiple queues directly, without the need for a task queue or PBS system. sos would need to handle load balancing etc.In practice our department has a number of individual servers not managed by PBS so it is possible to do this, but a PBS system would be much better at managing resources so I am wondering if there is really a need for SoS to support such a scenario.