uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

ParallelPool doesn't use all servers #189

Closed reox closed 4 years ago

reox commented 4 years ago

I found pathos via stackoverflow and so far it looks quite promising. I have a cluster here and I want to test it by running a small function on every node. However, I can not get it to work, not even on a single node...

My current setup: All the nodes have the same python and pathos version. On the node cluster02 I ran ppserver -p 31337 -d. My start script looks like this:

def sleepy_squared(x):
    from time import sleep
    import platform
    print("hello", platform.node())
    sleep(1)
    return x**2

from pathos.pools import ParallelPool as Pool
from pathos.pp import stats
p = Pool(servers=('cluster02:31337', ))

print(stats())

x = list(range(0, 64))
y = p.map(sleepy_squared, x)
print(y)
print(stats())

So far I get the correct result, but I only see the hostname of the host where I run the script. I read in #92 that I have to set the number of cpus on the local node to 0, but where is this function to set the cpus? In the stats I only see

Job execution statistics:
 job count | % of all jobs | job time sum | time per job | job server
        64 |        100.00 |       0.8065 |     0.012602 | local
Time elapsed since server creation 0.12056708335876465
0 active tasks, 8 cores

So it looks like the cluster node was never allocated? On the node however, I see a lot of messages like this:

2020-05-15 11:54:42,541 - pp - DEBUG - Closing client socket
Exception in thread client_socket:
Traceback (most recent call last):
  File "/programs/shared/anaconda/2019.10/python37/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/programs/shared/anaconda/2019.10/python37/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/reox/.local/bin/ppserver", line 187, in crun
    ctype = ppc.str_(mysocket.receive())
  File "/home/reox/.local/lib/python3.7/site-packages/pp/transport.py", line 252, in receive
    raise RuntimeError("Socket connection is broken")
RuntimeError: Socket connection is broken

2020-05-15 12:39:07,288 - pp - DEBUG - Control message received: STAT
2020-05-15 12:39:07,343 - pp - DEBUG - Exception in crun method (possibly expected)
Traceback (most recent call last):
  File "/home/reox/.local/bin/ppserver", line 198, in crun
    mysocket.receive()
  File "/home/reox/.local/lib/python3.7/site-packages/pp/transport.py", line 248, in receive
    msg = self.socket.recv(e_size-r_size)
ConnectionResetError: [Errno 104] Connection reset by peer

What is wrong here?

And an additional question: Is there a mode where the remote server is started by pathos on its own? For example if I have ssh access, could the script not start the servers on startup?

reox commented 4 years ago

I'm one step ahead. If I specicy Pool(nodes=0, servers=('cluster02:31337', 'cluster03:31337') I can get least get it to run on one of the cluster machines. However, there is no round-robin or something else. I found out that the Server object comes from parallelpython. So do I have to pass extra options to it to make it work?

mmckerns commented 4 years ago

I'd first try a pool of cluster02 only, and cluster03 only. If they both work individually, then that's progress. Putting them both in the same pool will favor whatever the algorithm detects as "faster". I'd have liked to have added more control over how the jobs get partitioned (i.e. N clusters, M/N tasks per cluster versus draw a job when one completes), but that's not implemented yet for distributed computing.

What are you seeing? Is one (e.g. server02) running all the jobs? When you run ppserver there's a -h option to get help printed out. I use -d, -p and sometimes -w and not -a. You can actually establish a ssh (or similar) connection and start up a poserver remotely. There's a script that gets installed that you can use to establish a ssh connection, like a tunnel, and will also launch a ppserver through the tunnel. See the pathos_connect script.

There are some outdated video demos (no sound) here: pp_pathos pp_mystic

API and names of objects have changed since then, but the examples still work. The updated versions of the examples used are in the examples folder for pathos, and the script that starts the tunnel and servers is now called pathos_connect.

reox commented 4 years ago

I'd first try a pool of cluster02 only, and cluster03 only.

If I'm setting one server it runs locally. If I'm specifying nodes=0 it runs on the server (or the last one in the tuple).

I tried to understand the logic in here: https://github.com/uqfoundation/pathos/blob/675beb712b519eee16ab6dc3c3ece6f93792f3b4/pathos/parallel.py#L193-L220 and I was thinking that maybe the allocation does not really work?

Because when I use parallelpython directly like this:

>>> import pp
>>> def fun(x):
...     import time
...     import platform
...     print("hello from {}".format(platform.node()))
...     time.sleep(10)
...     return x**2
...
>>> x = pp.Server(ppservers=('cluster02:31337', 'cluster03:31337'))
>>> for i in range(8+16+16):
...     x.submit(fun, (i, ))
>>> x.print_stats()
Job execution statistics:
 job count | % of all jobs | job time sum | time per job | job server
         8 |         20.00 |      80.1860 |    10.023256 | local
        16 |         40.00 |     160.1439 |    10.008993 | cluster02:31337
        16 |         40.00 |     160.4629 |    10.028932 | cluster03:31337
Time elapsed since server creation 59.512935400009155
0 active tasks, 8 cores

I can see that all nodes were used (including local). However, that is not the case for pathos, even if more jobs are available than CPUs.

mmckerns commented 4 years ago

Sure. It's possible there's a bug in play with pathos in handling the servers. Just to confirm, you tested with your script you posted initially (going through pathos) and then using the second script (with ppft), and you see different behavior. There was no other differences in what you did. Is that correct?

reox commented 4 years ago

There was no other differences in what you did. Is that correct?

Yes, so to summarize:

p = Pool(servers=('cluster02:31337', 'cluster03:31337'))

runs only locally

p = Pool(nodes=0, servers=('cluster02:31337', 'cluster03:31337'))

starts on the last server in the list

any other argument to ncpus or nodes also only runs locally (so far I have not found a configuration which works) Using pp.Server directly with just ppservers in the argument seems to distribute correctly.

mmckerns commented 4 years ago

So... you found a bug. What happened was that there was a map(fun, *args), and while that was ok for python 2, it needed to be list(map(fun, *args)) for python 3. Fixed in: https://github.com/uqfoundation/pathos/commit/4e2a19adb2afdeff076268d275e1d531707c8048. Thanks for pointing this out.

I'm going to close this, but feel free to post another issue if you run into anything else.

reox commented 4 years ago

:D I would not have expected that :) Thanks for fixing it! I'll test it and see if I have any more problems.