radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Can user manually decide how to schedule tasks (weighting factor / completely manual assignment)? #3023

Closed GKNB closed 1 week ago

GKNB commented 1 year ago

I notice that sometimes the scheduler in RCT can not give the optimal scheduling scheme. There is a simple example as follows, where I have two tasks, one with 48 processes and 1 thread per process, and the other with 8 processes and 2 CPU + 1 GPU per process:

import radical.pilot as rp
import radical.utils as ru

report = ru.Reporter(name='radical.pilot')
report.title('Getting Started (RP version %s)' % rp.version)

session = rp.Session()

pmgr   = rp.PilotManager(session=session)
tmgr   = rp.TaskManager(session=session)

report.header('submit pilots')
n_nodes = 2
pd_init = {
        'resource'      : 'anl.polaris',
        'runtime'       : 10,  # pilot runtime (min)
        'exit_on_error' : True,
        'project'       : 'RECUP',
        'queue'         : 'debug',
        'access_schema' : 'local',
        'cores'         : 32 * n_nodes,
        'gpus'          : 4 * n_nodes
        }
pdesc = rp.PilotDescription(pd_init)
pilot = pmgr.submit_pilots(pdesc)

report.header('submit tasks')
tmgr.add_pilots(pilot)

td1 = rp.TaskDescription()
td1.cpu_processes    = 48
td1.cpu_threads      = 1
td1.executable       = '/bin/bash'
td1.arguments        = ['/home/twang3/myWork/rct-unit-test/rp_wait_task/date_and_sleep.sh']

td2 = rp.TaskDescription()
td2.cpu_processes    = 8
td2.cpu_threads      = 2
td2.gpu_processes    = 1
td2.gpu_process_type = rp.CUDA
td2.executable       = '/bin/bash'
td2.arguments        = ['/home/twang3/myWork/rct-unit-test/rp_wait_task/date_and_sleep.sh']

#!!! In most way 1 and 2, large number of tasks might not arrive at the same time
#Usual way 1, should not run in parallel
tmgr.submit_tasks([td1, td2])

#Usual way 2, should not run in parallel
#tmgr.submit_tasks([td2])
#tmgr.submit_tasks([td1])
#
##Adcanced way Mikhail introduce, should always run in parallel
#tmgr.submit_tasks([td2])
#tmgr.wait_tasks(state=rp.AGENT_SCHEDULING)
#tmgr.submit_tasks([td1])

tmgr.wait_tasks()

report.header('finalize')
session.close(download=True)
report.header()

The Polaris machine has 32 CPU and 4 GPU per node so that the above workflow can be executed in parallel. For example, we can use core 0-7 and GPU 0-3 on each node for the second task and the remaining core 8-31 on both nodes for the first task. However, when rp launches this script, the two tasks run in serial. The reason is that the scheduler first schedules the first task, which uses core 0-31 on the first node and core 0-15 on the second node. Then it tries to schedule the second task and realizes there is not enough GPU / CPU combination on the second node, so the second task can only start after the first task is finished.

I think one solution is to change the weighting factor of tasks when scheduler schedules tasks. For example, currently the scheduler associate each task with an attribute "tuple_size", and the scheduler will first sort tasks according to some "score" computed based on the tuple_size. If we change the implementation of the score to, for example, gpu_per_process, then the current issue can be solved. However, to make the scheduler more robust, we might need to come up with a more complicated way to compute the score.

Another solution is based on Mikhail's suggestion: We can first submit task 2, and ask scheduler to wait for task 2 to be scheduled, and then submit task 1 (see the line segment being commented out as "advanced"). In this case, the code works fine, but it is currently unsupported in entk, which is currently needed.

Another more serious but less common issue I want to mention is that, there is some scenario where even if we change the weight factor, the scheduler still can not give us the optimal schedule scheme, which I believe is due to the fact that a greedy algorithm is not the optimal algorithm for the scheduler. If we look at the following example:

import radical.pilot as rp
import radical.entk  as re

t1 = re.Task({
    'executable'    : '/bin/bash',
    'arguments'     : ['/home/twang3/myWork/rct-unit-test/schedule_alg/date_and_sleep.sh'],
    'cpu_reqs'      : {'cpu_processes'  : 2,
                       'cpu_threads'    : 14,
                       'cpu_thread_type': rp.OpenMP},
    })

t2 = re.Task({
    'executable'    : '/bin/bash',
    'arguments'     : ['/home/twang3/myWork/rct-unit-test/schedule_alg/date_and_sleep.sh'],
    'cpu_reqs'      : {'cpu_processes'  : 4,
                       'cpu_threads'    : 9,
                       'cpu_thread_type': rp.OpenMP},
    })

#[14 9 9][14 9 9]
#[9 9 9][9]

s = re.Stage()
s.add_tasks([t1, t2])
p = re.Pipeline()
p.add_stages(s)

amgr = re.AppManager()
n_nodes = 2
amgr.resource_desc = {
        'resource'  :   'anl.polaris',
        'project'   :   'RECUP',
        'queue'     :   'debug',
        'cpus'      :   n_nodes * 32,
        'gpus'      :   n_nodes * 4,
        'walltime'  :   5
        }
amgr.workflow = [p]
amgr.run()

In this example, you can see that no matter if the scheduler first schedule the first task or the second task, the two tasks will not be able to run in parallel: (If it first schedule the first task, then first node will use core 0-27 for the first task, and there is no room for the second task. If it first schedule the second task, then first node will use core 0-26 and second node will use core 0-8 for the second task, and there is no room for the first task). However, if we schedule them at the same time, then we are able to run them in parallel: we can use core 0-13 on both node for the first task, and use core 14-31 on both node for the second task.

Because of that I am wondering if it is possible to allow users to decide how to schedule tasks in a completely manual fashion? One by-product is that the user also knows the NUMA boundary, and by telling the scheduler how to schedule each task manually, we don't need to let the scheduler know the NUMA boundary, and we can manually optimize the scheduling scheme.

andre-merzky commented 2 months ago

This is addressed in #3117 and #3199 now.

andre-merzky commented 1 week ago

Closed via #3117