python-adaptive / adaptive

:chart_with_upwards_trend: Adaptive: parallel active learning of mathematical functions
http://adaptive.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.16k stars 60 forks source link

Inquiry on implementation of parallelism on the cluster #208

Closed hainingpan closed 5 years ago

hainingpan commented 5 years ago

I intend to use multiple cores on the cluster so I read the documentation https://adaptive.readthedocs.io/en/latest/tutorial/tutorial.parallelism.html#mpi4py-futures-mpipoolexecutor What confuses me is the number of nodes to specify in mpiexec -n. In the documentation, it says mpiexec -n 16 python -m mpi4py.futures run_learner.py. I am just wondering that why it is 16 instead of 1. (As it says 1 in 'On your laptop/desktop you can run this script like: mpiexec -n 1 python run_learner.py ', is it just assuming laptop/desktop computer only has one core? ) By specifying 16, will this become an issue that multiple instances are created on different cores, so that the adaptive.Runner() on each core is actually being executed simultaneously?(if so, it doesn't seem to benefit from parallelism since every core is doing the redundant same thing.) But I thought the expected behavior would be: there are multiple processes but only one is the master process. The other process will take jobs from the process pool. So I guess the code will be like:

if (rank==0)  #assume master process has rank=0
    learner=adaptive.learner(...)
    runner=adaptive.runner(...)
else #for the slaves that take jobs from parallel pool, but I am not sure how to impletement this part
    ...

So, how do I know in which mode the Runner behaves or whether it benefits from parallelism?

hainingpan commented 5 years ago

To make my question clearer, please look at this example code in the documentation:

from mpi4py.futures import MPIPoolExecutor

learner = adaptive.Learner1D(f, bounds=(-1, 1))

# load the data
learner.load(fname)

# run until `goal` is reached with an `MPIPoolExecutor`
runner = adaptive.Runner(
    learner,
    executor=MPIPoolExecutor(),
    shutdown_executor=True,
    goal=lambda l: l.loss() < 0.01,
)

# periodically save the data (in case the job dies)
runner.start_periodic_saving(dict(fname=fname), interval=600)

# block until runner goal reached
runner.ioloop.run_until_complete(runner.task)

This code does not specify any process to be master or slave. But when the parallel processes all reach the sentence runner.start_periodic_saving(dict(fname=fname), interval=600), will this be the scenario that each process is just trying to save the data it has? If so, will this become a mess? Because, on the distributed cluster, the cores are usually not sharing the memory, so that how can one knows what others have on their memory? So the later one will overwrite the previous with the data which the later one prossesses. I would expect the ideal scenario to be only one process is doing the saving data job though everyone is involved in Runner. But since here no one is assigned to become the master(the one to do the saving job), how could this be possible?

basnijholt commented 5 years ago

When you start your process like: mpiexec -n 16 python -m mpi4py.futures run_learner.py it means that 15 cores will go to the MPIPoolExecutor() and 1 core will be for the local process, the one in which the learner and runner communicate.

I am not entirely sure why on your laptop it works with -n 1, @dalcinl (the author of mpi4py) wrote that here. edit: You can actually use export MPI4PY_MAX_WORKERS=15 or -n 16!

More is explained here.

Note that this MPIPoolExecutor is only useful when your code doesn't use MPI already!~~

dalcinl commented 5 years ago

Just to clarify it again, when you run your code this way in your laptop:

mpiexec -n 1 python run_learner.py # note there is not '-m mpi4py.futures' arg

then mpi4py.futures will spawn new workers at runtime using an MPI feature that is called dynamic process management, and this feature is in the MPI standard since 1998. You can use export MPI4PY_MAX_WORKERS=15 to control the number of workers, or you can do it by requesting a specific number of workers in your code with MPIPoolExecutor(num_workers). If you create many executors, each will have its own pool of workers, each pool isolated from the others.

Unfortunately, we are in 2019, and cluster/supercomputers still fail to provide support for dynamic process management, or they make it quite cumbersome to setup. So you cannot (or cannot easily) spawn new workers once the MPI execution started. To alleviate this mess, mpi4py.futures provides an alternative execution mechanism: you start all the P processes as part of MPI execution, and then mpi4py.futures splits COMM_WORLD in one master and P-1 workers. To use this alternative execution mechanism you have to run this way (BTW, it also works in your laptop):

mpiexec -n $P python -m mpi4py.futures run_learner.py # note the `-m mpi4py.futures` flag

A minor annoyance of this execution mode is that if your code creates many executors, all of them will share the same pool of P-1 workers.