uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

Configuration of cluster nodes in ParallelPool #190

Closed reox closed 4 years ago

reox commented 4 years ago

I tested now with the latest master and I can get it to work, however, ncpus and nodes behave weirdly... I try to document my findings here.

This code works to distribute the load ONLY to the cluster nodes and no process is started on the local machine:

def squared(x):
    import platform
    print("x={} -- hello from: {}".format(x, platform.node()))
    return x**2

from pathos.pools import ParallelPool as Pool
from pathos.pp import stats
p = Pool(nodes=0, servers=('cluster02:31337', 'cluster03:31337'))
x = list(range(0, 64))
y = p.map(squared, x)
print(y)
print(stats())

Also using ncpus=0 in Pool has the same effect.

It looks like that setting ncpus or nodes has the same effect, so for me the two parameters looks redundant. If 0 < nodes/ncpus <= ncpus_of_local_node, it will ONLY start on local. If nodes/ncpus == 0 OR nodes/ncpus > ncpus_of_local_node it will start nodes/ncpus processes on local and then start distributing the cluster (cluster02, cluster03, cluster02, cluster03, ...). But it looks like this behaviour is not always deterministic. While testing, I found out that in some cases you get a "nice pattern", but in other instances the majority is spawned locally. The most deterministic is to set nodes/ncpus to zero, which seems to distribute it to the cluster evenly (in most cases I would get a 30/34 or 31/33 though, so it is not split perfectly).

mmckerns commented 4 years ago

Yes, ncpus and nodes are redundant. One is an artifact of a prior interface.

This is roughly what I'd expect. If you have 4 local nodes, and 4 remote nodes pointed at "cluster", then if you have 4 jobs, they will be run locally. If you have 6 jobs, 4 will run locally, and 2 remotely. You might want to try setting up a ssh-tunnel to a local port instead of a direct connection to the remote server. That's what I have done more often. So your servers looks like this servers=('localhost:1234', 'localhost:5678'). I believe that still has the same effect though, which is to generally prefer local nodes.

reox commented 4 years ago

Okay thanks for the confirmation :) Setting it to zero is actually the best method for me, so I have one node which does all the orchestration and the others are just worker nodes.

I played around with the tunnel, however I had problems that if the script crashed, it would leave all the tunnels open and I had to kill all the ssh sessions by hand later.

Yet another beginner's question: What would be the best method to use logging? currently I use the python logger quite a lot, and I also want to collect the log output from the worker nodes. Do I need to setup a logging server and collect the logs from there? I mean just to have some output is easy, I just have to configure a console handler in the called function.

mmckerns commented 4 years ago

Yes, that can be an issue with using tunnel... which is why I have the job id printed when the tunnel starts. Just in case you need to kill it by hand. I do have some tools in pathos.core and elsewhere which attempt to do some cleanup when processes terminate unexpectedly, but there's no finally that ensures the tunnels are killed when a script crashes -- primarily, as the tunnels are independent of the scripts running through them.

With respect to logging, I'm not certain what you want to do. Are you logging to a file, or to stdout on some remote server...? Maybe best to ask in a separate question. Closing this issue.