Closed reox closed 4 years ago
Yes, ncpus
and nodes
are redundant. One is an artifact of a prior interface.
This is roughly what I'd expect. If you have 4 local nodes, and 4 remote nodes pointed at "cluster", then if you have 4 jobs, they will be run locally. If you have 6 jobs, 4 will run locally, and 2 remotely. You might want to try setting up a ssh-tunnel to a local port instead of a direct connection to the remote server. That's what I have done more often. So your servers looks like this servers=('localhost:1234', 'localhost:5678')
. I believe that still has the same effect though, which is to generally prefer local nodes.
Okay thanks for the confirmation :) Setting it to zero is actually the best method for me, so I have one node which does all the orchestration and the others are just worker nodes.
I played around with the tunnel, however I had problems that if the script crashed, it would leave all the tunnels open and I had to kill all the ssh sessions by hand later.
Yet another beginner's question: What would be the best method to use logging? currently I use the python logger quite a lot, and I also want to collect the log output from the worker nodes. Do I need to setup a logging server and collect the logs from there? I mean just to have some output is easy, I just have to configure a console handler in the called function.
Yes, that can be an issue with using tunnel... which is why I have the job id
printed when the tunnel starts. Just in case you need to kill it by hand. I do have some tools in pathos.core
and elsewhere which attempt to do some cleanup when processes terminate unexpectedly, but there's no finally
that ensures the tunnels are killed when a script crashes -- primarily, as the tunnels are independent of the scripts running through them.
With respect to logging, I'm not certain what you want to do. Are you logging to a file, or to stdout on some remote server...? Maybe best to ask in a separate question. Closing this issue.
I tested now with the latest master and I can get it to work, however,
ncpus
andnodes
behave weirdly... I try to document my findings here.This code works to distribute the load ONLY to the cluster nodes and no process is started on the local machine:
Also using
ncpus=0
inPool
has the same effect.It looks like that setting
ncpus
ornodes
has the same effect, so for me the two parameters looks redundant. If0 < nodes/ncpus <= ncpus_of_local_node
, it will ONLY start on local. Ifnodes/ncpus == 0 OR nodes/ncpus > ncpus_of_local_node
it will startnodes/ncpus
processes on local and then start distributing the cluster (cluster02, cluster03, cluster02, cluster03, ...). But it looks like this behaviour is not always deterministic. While testing, I found out that in some cases you get a "nice pattern", but in other instances the majority is spawned locally. The most deterministic is to setnodes/ncpus
to zero, which seems to distribute it to the cluster evenly (in most cases I would get a 30/34 or 31/33 though, so it is not split perfectly).