Open root-11 opened 9 months ago
In branch mp
I've run the latest example of test_multi_processing_advanced.py
with pypy310 with pytest installed in the virtual environment (see pypy for details)
The test is here: https://github.com/root-11/maslite/blob/mp/tests/test_multi_processing_advanced.py
This test is written in "old style" so that the clock doesn't progress whilst the agents are negotiating. However, when the agents are idle the clock jumps to the next alarm.
At the bottom of the test, you will find stdtout with the line 513:
# wall time: 0.047859992999292444 / sim time: 4.75 = 99.25X
In many agent based systems, communication has scale free distribution characteristics. This means that a lot of messages are exchanged within clusters of agents, whilst few messages are exchanged between clusters.
For the purpose of simulation, this means that there is a benefit is distributing the simulation to multiple cores and exploit the principles of multiprocessing.
Why not just keep things simple? Let's say that you want to run a simulation at 100x real-time, but you can't because there are 8 clusters of agents that have sufficiently large overhead and exchange numerous messages to cause you to slow down to a ${wall time}/{simulated time}$-relationship of 20x real-time. In other words, your simulation takes 5x as long to run as you would like. However you also know that None of the clusters individually run slower than +100x real-time. E.g. your clusters are fast enough individually.
It would hence be convenient to partition the system such that each cluster is on their own logical core.
Modern PC have many cores. I can for example log on to Azure and rent a 48 core machine with 1 Tb RAM for under $100 for an hour.
For problems where the clusters exceed the capacity of a single machine, I would have to partition the simulation over multiple machines and permit message exchange between the machines using sockets.
As the expansion from single core simulation to multi-core simulation is the first and most important step, I will in what follows present a simple recipe for such a configuration, which I will explain step by step. The whole script is at the bottom of this post as
mp_test.py
so you won't need to copy paste the code.A nice API.
The first thing we want is a nice API, but for multiprocessing we can't just launch a python process and split the scheduler: We need a parent-child relationship. In the snippet below, I have chosen
MPmain
as the parent, andMPScheduler
as the child.The
MPScheduler
is practically just a regular maslite.Scheduler with some multiprocessing switches added.MPmain
is a constructor for each subprocess so I've addede the methodMPmain.new_scheduler()
as a constructor of theMPScheduler
so it is packaged with all necessary the multiprocessing queues for message exchange between the processes.To give this some context, I am setting up a simple and generic demonstration, where
mpmain
controls two schedulers which each have two agents (4 agents in total).We add the agents to either scheduler (=cluster/partition) and ask the first agent to obtain a signature from all other agents in the system. The agent which sees that all other agents have signed the message will abort further message exchange by sending a
contract
tompmain
which stops the simulation.Here is the script:
And this is the output:
We see how:
mpmain
receives the contract and starts to shut the simulation down.mpmain
acknowledges that the schedulers have stopped correctly.When you run the code, you will notice that it takes about 1/2 a second to start and to stop the subprocesses. This is because python needs to copy the loaded variables from the main process to the subprocesses. For simulations with a large number of agents it could hence be more efficient to add a start-up process for each scheduler that loads what it needs to avoid the overhead of copying objects around. In this snippet however I'd rather leave it out, so and focus on the messaging.
In the code you will find two messages (
ChainMsg
andContract
) and a signal (Stop
). These are trivial.You will also find a new class
Link
which is a container for the interprocess communication between. There are to attributes:These two queues are multiprocessing.Queues which enable the parent (main) process to send messages to the child process (
to_proc
) and back (to_mp
) and hence is the only object that is shared.Sharing any other type of object between processes will lead to synchronisation issues as the one process will hold python's GIL and lock the object. The good news, however is that you can send any python object from one process to the other using
queue.put(obj)
on the sender side and retrieve it on the other side usingobj = queue.get_nowait()
.The class
MPmain
has two methods that you may have come across before:These methods enable us to use the class as a context manager, so that it works with the with statement, just like file handles where the file is closed when the with-block is left:
We need the context manager style to assure that the subprocesses shut down properly with we leave the context as a context managers
__exit__
method always is called when an exception is raised.For convenience,
MPmain
has the familiar apirun
to mirror the current schedulers top-level behaviour.This brings us to the individual schedulers that host each cluster: It is very similar to a typical maslite scheduler with some additional features for interprocess communication:
The toplevel runner has 4 steps:
This
run
method is called automatically bympmain
when the simulation starts, so you need to do nothing.Next is the method
process_inter_proc_mail
.This process looks at the message queue to self (
self.mq_to_self
) and tries to retrieve any pending messages and add them to it's mainmail_queue
.Next, the
run
-loop processes the messages, just like the maslite scheduler.This code should appear very familiar. The only new thing is the if-else clause where the scheduler concludes that
if msg.receiver is not in self.agents
then it must be a message for someone outside its cluster, and it puts the message onto the queue to main.Finally, as the messages have been sorted to each of the agents, the agents are updated with the familiar method
update_agents
:As you can see the system is very simple as there is no cognitive overhead for the developer. The only difference from the classical
maslite.scheduler
usage is that the developer needs to decide in which cluster to put the agent.This brings me to the final note: Inter-process communication is a bit slower than
within-process
-communication, so if the agents communicate a lot outside their cluster, the system will seem slower (!) than running it conventionally. This is expected, and developer just needs to know this.Here is the whole script: