Running tasks that require multiple nodes

mdorier commented 2 years ago

I just started reading about GPtune and I was wondering if it could handle the use-case that I have:

My "objective function" comes in the form of a workflow that that needs to run on N nodes (let's say 4). It comes in the form of a bash script that takes input parameters, does some setup, and calls a series of mpirun/aprun/srun (whichever MPI launcher is appropriate). The bash script ends up printing the workflow's run time in its standard output, which is what I am interested in optimizing as a function of the input parameters.

Within an allocation of M nodes (say M = 128), I typically spawn M/N instances of the workflow bash script in parallel, and look at their run time.

Can GPtune "drive" the execution of these workflow instances, i.e. continue spawning new workflow instances and decide on what parameters to try next?

If I understand GPtune correctly, it is itself an MPI-based application, optimizing a python objective function in parallel on many compute nodes. But in my case I would need GPtune to run on the MOM node rather than the compute node, to be aware of the number of compute nodes available in the job allocation, and to call the workflow script from the MOM node so that the workflow script can take care of executing the appropriate MPI commands to start the workflow steps.

If GPtune is able to do something like that, do you have an example code that I can try?

liuyangzhuan commented 2 years ago

Hi, yes. I think GPTune can address this situation. Before proceeding, can you explain more about

"Within an allocation of M nodes (say M = 128), I typically spawn M/N instances of the workflow bash script in parallel, and look at their run time." Does each of the M/N instances have its own input parameters, or do they share the same input parameters? Do you want GPTune to decide on which next single instance, or the next M/N instances in one shot?

mdorier commented 2 years ago

Yes, each workflow instance has its own set of parameter values, chosen manually for now, and I would like GPtune to select them based on a configuration space that I can define.

liuyangzhuan commented 2 years ago

Based on your info, here is what gptune can do. In a typical setting, GPTune does Gaussian process modeling, it will ask for "NS" total function evaluations. It will first generate "NS1" random parameter configurations as pilot samples, and then "NS-NS1" sequential samples iteratively. If you are doing single-objective tuning, each iteration generates 1 new sample; if you are doing multi-objective, each iteration can generates more than 1 sample.

I'm assuming you are doing single-objective (correct me if this is not the case)? If so, the NS1 pilot samples can be run in parallel (say calling M/N instances each time), but the sequential samples you have to call one instance at a time. You could perhaps set a large NS1 value such that the you use as much parallelism as possible.

The closest working example is https://github.com/gptune/GPTune/blob/master/examples/SuperLU_DIST_RCI/superlu_MLA_RCI.sh https://github.com/gptune/GPTune/blob/master/examples/SuperLU_DIST_RCI/superlu_MLA_RCI.py

More explanation can be found in 4.18 and 5.3 of https://github.com/gptune/GPTune/blob/master/Doc/GPTune_UsersGuide.pdf

Note that in superlu_MLA_RCI.sh, each iteration of the while loop (will contain multiple iterations for all pilot samples) at line 58 asks for one sample (in your case you launch one instance with N nodes), the application is launched via srun/jsrun/mpirun at lines 99-124 (in your case you call your bash script here). The application takes input parameters from command line and env variables (in your case you just pass them to your bash script), and read output from the runlog, a.out.

Please give it a try and let me know if you have problems understanding these two files.

mdorier commented 2 years ago

Thanks, that's helpful. So if I understand correctly, after the initial NS1 function evaluations, which can be run in parallel, parallelization can only be done across objectives (e.g. 2 evaluations in parallel if I have 2 objectives)?

liuyangzhuan commented 2 years ago

If you have a single objective, the model will predict one sample at a time. But if you have multiple objectives, then the model will predict a few (which can be any number by setting https://github.com/gptune/GPTune/blob/c79f64a7a7286002ad79f3ab42d2b6b963c199f8/GPTune/options.py#L77 ) on the estimated pareto front. The quality of multiple samples in each iteration cannot guaranteed though.

younghyunc commented 2 years ago

Hi @mdorier

Regarding @liuyangzhuan's comment below on parallel sample evaluation,

If so, the NS1 pilot samples can be run in parallel (say calling M/N instances each time), but the sequential samples you have to call one instance at a time. You could perhaps set a large NS1 value such that the you use as much parallelism as possible.

We have checked the feature to evaluate pilot samples in parallel. First, please use our latest version (latest commit: https://github.com/gptune/GPTune/commit/01b122070b35a08c17e39be0105f6f066693cf71)

And, in your GPTune driver, assuming M is the number of allocated nodes and N is the number of nodes to be used by each instance, please set the following options.

options['distributed_memory_parallelism']=False options['shared_memory_parallelism'] = True options['objective_evaluation_parallelism']=True options['objective_multisample_threads']=M/N options['objective_nospawn']=True options['objective_nprocmax']=N

This will allow the pilot samples to be run and evaluated in parallel. Please let us know if there are any further issues. Thank you!

liuyangzhuan commented 2 years ago

Thanks @younghyunc! Just a minor correction, M and N should be physical cores instead of nodes.

mdorier commented 2 years ago

Whether M and N are number of cores or nodes, this won't affect objective_multisample_threads, since it's M/N, but it will affect objective_nprocmax. What is the purpose of objective_nprocmax?

liuyangzhuan commented 2 years ago

objective_nprocmax defines the number of cores required for each function evaluation, you need to set it whenever objective_evaluation_parallelism is used. This will trigger a few internal validation tests inside GPTune. Just to be more clear, suppose your has M compute nodes, m cores per node, and each run requires N nodes, then you need to set: options['distributed_memory_parallelism']=False options['shared_memory_parallelism'] = True options['objective_evaluation_parallelism']=True options['objective_multisample_threads']=M/N options['objective_nospawn']=True options['objective_nprocmax']=N*m also in .gptune/meta.json of your application folder (you can generate the json file by using jq, take https://github.com/gptune/GPTune/blob/da5c4166edfb3558798f90d7522ad70c2dadfbe3/run_env.sh#L407 and https://github.com/gptune/GPTune/blob/da5c4166edfb3558798f90d7522ad70c2dadfbe3/examples/GPTune-Demo/run_examples.sh#L11 for an example), you need to set: "nodes": M, "cores": m

liuyangzhuan commented 2 years ago

I'm closing this issue now. For future reference, I'm adding the following notes: I attached a working example demo.py here. One just needs to put it into examples/GPTune-Demo and run run_examples.sh. Each function evaluation takes 1s, and pay attention to 'time_fun' in the runlog. You can run it again after setting options['objective_evaluation_parallelism'] = False and you can check 'time_fun' again.

demo.py:

import sys import os import logging

sys.path.insert(0, os.path.abspath(file + "/../../../GPTune/")) logging.getLogger('matplotlib.font_manager').disabled = True

from autotune.search import from autotune.space import from autotune.problem import from gptune import # import all

import argparse import numpy as np import time

from callopentuner import OpenTuner from callhpbandster import HpBandSter

def parse_args():

parser = argparse.ArgumentParser()

parser.add_argument('-nodes', type=int, default=1,help='Number of machine nodes')
parser.add_argument('-cores', type=int, default=2,help='Number of cores per machine node')
parser.add_argument('-machine', type=str,default='-1', help='Name of the computer (not hostname)')
parser.add_argument('-optimization', type=str,default='GPTune', help='Optimization algorithm (opentuner, hpbandster, GPTune)')
parser.add_argument('-ntask', type=int, default=1, help='Number of tasks')
parser.add_argument('-nrun', type=int, default=20, help='Number of runs per task')
parser.add_argument('-perfmodel', type=int, default=0, help='Whether to use the performance model')
parser.add_argument('-tvalue', type=float, default=1.0, help='Input task t value')

args = parser.parse_args()

return args

def objectives(point): """ f(t,x) = exp(- (x + 1) ^ (t + 1) cos(2 pi x)) (sin( (t + 2) (2 pi x) ) + sin( (t + 2)^(2) (2 pi x) + sin ( (t + 2)^(3) (2 pi x)))) """ t = point['t'] x = point['x'] a = 2 np.pi b = a t c = a x d = np.exp(- (x + 1) (t + 1)) np.cos(c) e = np.sin((t + 2) c) + np.sin((t + 2)2 * c) + np.sin((t + 2)*3 c) f = d * e + 1

# print('test:',test)
"""
f(t,x) = x^2+t
"""
# t = point['t']
# x = point['x']
# f = 20*x**2+t
time.sleep(1.0)

return [f]

def main():

import matplotlib.pyplot as plt
global nodes
global cores

# Parse command line arguments
args = parse_args()
ntask = args.ntask
nrun = args.nrun
tvalue = args.tvalue
TUNER_NAME = args.optimization
perfmodel = args.perfmodel

(machine, processor, nodes, cores) = GetMachineConfiguration()
print ("machine: " + machine + " processor: " + processor + " num_nodes: " + str(nodes) + " num_cores: " + str(cores))
os.environ['MACHINE_NAME'] = machine
os.environ['TUNER_NAME'] = TUNER_NAME

input_space = Space([Real(0., 10., transform="normalize", name="t")])
parameter_space = Space([Real(0., 1., transform="normalize", name="x")])
# input_space = Space([Real(0., 0.0001, "uniform", "normalize", name="t")])
# parameter_space = Space([Real(-1., 1., "uniform", "normalize", name="x")])

output_space = Space([Real(float('-Inf'), float('Inf'), name="y")])
constraints = {"cst1": "x >= 0. and x <= 1."}
if(perfmodel==1):
    problem = TuningProblem(input_space, parameter_space,output_space, objectives, constraints, models)  # with performance model
else:
    problem = TuningProblem(input_space, parameter_space,output_space, objectives, constraints, None)  # no performance model

computer = Computer(nodes=nodes, cores=cores, hosts=None)
options = Options()

options['model_restarts'] = 1

options['distributed_memory_parallelism'] = True
options['shared_memory_parallelism'] = False

options['objective_evaluation_parallelism'] = True
options['objective_multisample_threads'] = 1
options['objective_multisample_processes'] = 4   # maximum number of function evaluations running in parallel
options['objective_nprocmax'] = 2     # number of cores per function evaluation

options['model_processes'] = 1
# options['model_threads'] = 1
# options['model_restart_processes'] = 1

# options['search_class']=='SearchSciPy'
options['search_multitask_processes'] = 1
# options['search_multitask_threads'] = 1
# options['search_threads'] = 16

# options['mpi_comm'] = None
#options['mpi_comm'] = mpi4py.MPI.COMM_WORLD
options['model_class'] = 'Model_GPy_LCM' #'Model_GPy_LCM'
options['verbose'] = False
# options['sample_algo'] = 'MCS'
# options['sample_class'] = 'SampleLHSMDU'

options.validate(computer=computer)

if ntask == 1:
    giventask = [[round(tvalue,1)]]
elif ntask == 2:
    giventask = [[round(tvalue,1)],[round(tvalue*2.0,1)]]
else:
    giventask = [[round(tvalue*float(i+1),1)] for i in range(ntask)]

NI=len(giventask)
NS=nrun

TUNER_NAME = os.environ['TUNER_NAME']

if(TUNER_NAME=='GPTune'):
    data = Data(problem)
    gt = GPTune(problem, computer=computer, data=data, options=options,driverabspath=os.path.abspath(__file__))
    (data, modeler, stats) = gt.MLA(NS=NS, Igiven=giventask, NI=NI, NS1=NS-1, T_sampleflag=[True]*NI)
    # (data, modeler, stats) = gt.MLA(NS=NS, Igiven=giventask, NI=NI, NS1=NS-1)
    print("stats: ", stats)
    """ Print all input and parameter samples """
    for tid in range(NI):
        print("tid: %d" % (tid))
        print("    t:%f " % (data.I[tid][0]))
        print("    Ps ", data.P[tid])
        print("    Os ", data.O[tid].tolist())
        print('    Popt ', data.P[tid][np.argmin(data.O[tid])], 'Oopt ', min(data.O[tid])[0], 'nth ', np.argmin(data.O[tid]))

if(TUNER_NAME=='opentuner'):
    (data,stats)=OpenTuner(T=giventask, NS=NS, tp=problem, computer=computer, run_id="OpenTuner", niter=1, technique=None)
    print("stats: ", stats)
    """ Print all input and parameter samples """
    for tid in range(NI):
        print("tid: %d" % (tid))
        print("    t:%f " % (data.I[tid][0]))
        print("    Ps ", data.P[tid])
        print("    Os ", data.O[tid].tolist())
        print('    Popt ', data.P[tid][np.argmin(data.O[tid])], 'Oopt ', min(data.O[tid])[0], 'nth ', np.argmin(data.O[tid]))

if(TUNER_NAME=='hpbandster'):
    (data,stats)=HpBandSter(T=giventask, NS=NS, tp=problem, computer=computer, run_id="HpBandSter", niter=1)
    print("stats: ", stats)
    """ Print all input and parameter samples """
    for tid in range(NI):
        print("tid: %d" % (tid))
        print("    t:%f " % (data.I[tid][0]))
        print("    Ps ", data.P[tid])
        print("    Os ", data.O[tid].tolist())
        print('    Popt ', data.P[tid][np.argmin(data.O[tid])], 'Oopt ', min(data.O[tid])[0], 'nth ', np.argmin(data.O[tid]))

if(TUNER_NAME=='cgp'):
    from callcgp import cGP
    options['EXAMPLE_NAME_CGP']='GPTune-Demo'
    options['N_PILOT_CGP']=int(NS/2)
    options['N_SEQUENTIAL_CGP']=NS-options['N_PILOT_CGP']
    (data,stats)=cGP(T=giventask, tp=problem, computer=computer, options=options, run_id="cGP")
    print("stats: ", stats)
    """ Print all input and parameter samples """
    for tid in range(NI):
        print("tid: %d" % (tid))
        print("    t:%f " % (data.I[tid][0]))
        print("    Ps ", data.P[tid])
        print("    Os ", data.O[tid].tolist())
        print('    Popt ', data.P[tid][np.argmin(data.O[tid])], 'Oopt ', min(data.O[tid])[0], 'nth ', np.argmin(data.O[tid]))

if name == "main": main()

gptune / GPTune

Running tasks that require multiple nodes #8