Closed reox closed 4 years ago
I'm one step ahead. If I specicy Pool(nodes=0, servers=('cluster02:31337', 'cluster03:31337')
I can get least get it to run on one of the cluster machines.
However, there is no round-robin or something else. I found out that the Server
object comes from parallelpython. So do I have to pass extra options to it to make it work?
I'd first try a pool of cluster02
only, and cluster03
only. If they both work individually, then that's progress. Putting them both in the same pool will favor whatever the algorithm detects as "faster". I'd have liked to have added more control over how the jobs get partitioned (i.e. N clusters, M/N tasks per cluster versus draw a job when one completes), but that's not implemented yet for distributed computing.
What are you seeing? Is one (e.g. server02
) running all the jobs? When you run ppserver
there's a -h
option to get help printed out. I use -d
, -p
and sometimes -w
and not -a
. You can actually establish a ssh (or similar) connection and start up a poserver remotely. There's a script that gets installed that you can use to establish a ssh connection, like a tunnel, and will also launch a ppserver through the tunnel. See the pathos_connect
script.
There are some outdated video demos (no sound) here: pp_pathos pp_mystic
API and names of objects have changed since then, but the examples still work.
The updated versions of the examples used are in the examples folder for pathos,
and the script that starts the tunnel and servers is now called pathos_connect
.
I'd first try a pool of cluster02 only, and cluster03 only.
If I'm setting one server it runs locally.
If I'm specifying nodes=0
it runs on the server (or the last one in the tuple).
I tried to understand the logic in here: https://github.com/uqfoundation/pathos/blob/675beb712b519eee16ab6dc3c3ece6f93792f3b4/pathos/parallel.py#L193-L220 and I was thinking that maybe the allocation does not really work?
Because when I use parallelpython directly like this:
>>> import pp
>>> def fun(x):
... import time
... import platform
... print("hello from {}".format(platform.node()))
... time.sleep(10)
... return x**2
...
>>> x = pp.Server(ppservers=('cluster02:31337', 'cluster03:31337'))
>>> for i in range(8+16+16):
... x.submit(fun, (i, ))
>>> x.print_stats()
Job execution statistics:
job count | % of all jobs | job time sum | time per job | job server
8 | 20.00 | 80.1860 | 10.023256 | local
16 | 40.00 | 160.1439 | 10.008993 | cluster02:31337
16 | 40.00 | 160.4629 | 10.028932 | cluster03:31337
Time elapsed since server creation 59.512935400009155
0 active tasks, 8 cores
I can see that all nodes were used (including local). However, that is not the case for pathos, even if more jobs are available than CPUs.
Sure. It's possible there's a bug in play with pathos
in handling the servers. Just to confirm, you tested with your script you posted initially (going through pathos
) and then using the second script (with ppft
), and you see different behavior. There was no other differences in what you did. Is that correct?
There was no other differences in what you did. Is that correct?
Yes, so to summarize:
p = Pool(servers=('cluster02:31337', 'cluster03:31337'))
runs only locally
p = Pool(nodes=0, servers=('cluster02:31337', 'cluster03:31337'))
starts on the last server in the list
any other argument to ncpus
or nodes
also only runs locally (so far I have not found a configuration which works)
Using pp.Server directly with just ppservers
in the argument seems to distribute correctly.
So... you found a bug. What happened was that there was a map(fun, *args)
, and while that was ok for python 2, it needed to be list(map(fun, *args))
for python 3. Fixed in: https://github.com/uqfoundation/pathos/commit/4e2a19adb2afdeff076268d275e1d531707c8048. Thanks for pointing this out.
I'm going to close this, but feel free to post another issue if you run into anything else.
:D I would not have expected that :) Thanks for fixing it! I'll test it and see if I have any more problems.
I found pathos via stackoverflow and so far it looks quite promising. I have a cluster here and I want to test it by running a small function on every node. However, I can not get it to work, not even on a single node...
My current setup: All the nodes have the same python and pathos version. On the node
cluster02
I ranppserver -p 31337 -d
. My start script looks like this:So far I get the correct result, but I only see the hostname of the host where I run the script. I read in #92 that I have to set the number of cpus on the local node to 0, but where is this function to set the cpus? In the stats I only see
So it looks like the cluster node was never allocated? On the node however, I see a lot of messages like this:
What is wrong here?
And an additional question: Is there a mode where the remote server is started by pathos on its own? For example if I have ssh access, could the script not start the servers on startup?