Open lukas-koschmieder opened 3 years ago
The program hangs in the while loop at https://github.com/ipython/ipyparallel/blob/527d7b2264c5ecca012c6d248990dc18c1058834/ipyparallel/cluster/launcher.py#L339 because the config files created in PROFILE/security
are named ipcontroller-client.json
and ipcontroller-engine.json
whereas the program expects the filenames to include cluster_id
, e.g. ipcontroller-1635416202-ou8n-client.json
, ipcontroller-1635416202-ou8n-engine.json
. Is this also fixed in #606?
This is why I need to get CI tests for all the non-slurm batch launchers (#604)!
I do believe the issue is fixed in dev, but those custom templates will still reintroduce the problem. If you add --cluster-id={cluster_id}
it will be fixed. The generic fix is to use {program_and_args}
instead of ipengine --profile-dir={profile_dir}
.
I believe these templates will work:
c.PBSControllerLauncher.batch_template = '''
#PBS -N ipcontroller
#PBS -V
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1
cd $PBS_O_WORKDIR
conda activate ipp7
{program_and_args}
'''
c.PBSEngineSetLauncher.batch_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -V
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4
cd $PBS_O_WORKDIR
conda activate ipp7
module load intel
mpiexec -n {n} {program_and_args}
'''
The next release uses environment variables to pass things like the cluster id, which means you must add #PBS -V
to your options.
Thank you for the quick reply! The general method works. 👍
Is it possible to instantiate a Cluster
without an existing IPython profile and ipcluster_config.py
by passing c.PBSControllerLauncher.batch_template
and c.PBSEngineSetLauncher.batch_template
somehow directly to the class constructor from a Jupyter Notebook?
Pseudocode:
controller_template='''
#PBS -N ipcontroller
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1
##PBS -q {queue}
cd $PBS_O_WORKDIR
conda activate ipp7
{program_and_args}
'''
engine_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4
##PBS -q {queue}
cd $PBS_O_WORKDIR
conda activate ipp7
module load intel
mpiexec -n {n} {program_and_args}
'''
cluster=ipp.Cluster(
n=4,
controller_ip='*',
profile='pbs-2021-10-28',
extra_options={
'c.PBSControllerLauncher.batch_template':controller_template,
'c.PBSEngineSetLauncher.batch_template':engine_template
})
Yes! You populate the cluster.config
object, which is the same as c
in your ipcluster_config.py
:
cluster=ipp.Cluster(
n=4,
controller_ip='*',
profile='pbs-2021-10-28',
)
# this is the same config object you would configure in ipcluster_config.py
# you don't have to call it `c`, but if you do, the rest will look familiar
c = cluster.config
c.PBSControllerLauncher.batch_template = controller_template
c.PBSEngineSetLauncher.batch_template = engine_template
await cluster.start_cluster()
Fantastic! Thank you!
Adding lots of examples to my documentation todo list...
I'll make an 8.0 beta tomorrow. It would be great if you could test it out!
Okay, great, I will test it.
I've got another question and potential point for the documentation todo list: How do you configure the controller dynamically in Python / Jupyter Notebook (replacement for ipcontroller_config.py
)? For instance, how would you set c.HubFactory.ip='*'
?
That can be c.Cluster.controller_ip
via config, or since it's on the Cluster object, it can be a constructor argument:
Cluster(controller_ip="*")
HubFactory
is removed and replaced by IPController
, so if you do still have an ipcontroller_config.pyit would be
c.IPController.ip = '*'`.
The ambiguity is because there are really two things you are configuring:
Some common options for configuring the controller itself can be done on the Cluster, but for the most part ipcontroller is configured directly through either ipcontroller_config.py
or ControllerLauncher.controller_args
. Cluster(controller_ip="*")
is really a shortcut for c.ControllerLauncher.controler_args.append("--ip=*")
.
I just published 8.0.0b1
if you could give it a try
Okay, I'm currently in the middle of something but I will give it a try this afternoon/evening.
No rush! I've only got a few more minutes of work before the weekend. I'll probably aim to do a release around the end of next week.
I've installed the new beta version 8.0.0b1
and everything is looking fine - except for start_cluster_sync
, which now appears to be significantly slower. The controller starts immediately but there is some noticeable delay before the engines come up. I haven't tested if this delay scales with the number of engines. I was using 4 engines in my test.
import time
start = time.time()
cluster.start_cluster_sync()
end = time.time()
print(end - start)
8.0.0b1
outputJob submitted with job id: '23909'
Starting 4 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
Job submitted with job id: '23910'
30.14998745918274
7.1.0
(conda-forge) outputJob submitted with job id: '23911'
Starting 6 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
Job submitted with job id: '23912'
1.15324068069458
Edit: If the release is next week, unfortunately I won't be able to participate in additional beta testing because I am on holiday until Nov 8th.
except for start_cluster_sync, which now appears to be significantly slower.
That makes sense. It's the new Cluster.send_engines_connection_env
option, which means by default start_cluster
waits for the controller to finish starting before starting the engines, because the connection info is passed via environment through the Launcher. To disable this and rely on the connection files on disk (pre-8.0 behavior):
cluster = Cluster(send_engines_connection_env=False, engines='pbs', controller='pbs', cnotroller_ip='*')
then the engine and controller jobs should both be submitted immediately.
@lukas-koschmieder I just published 8.0.0rc1. Can you test and then close here if you think everything is resolved?
I have been using ipyparallel 6 for a while and would like to migrate to ipyparallel 7 mainly due to fact that the new Cluster API enables you to manage the entire process through a Jupyter Notebook. Unfortunatelly, I am having difficulties to connect a client to my cluster.
I have created a new IPython profile adding a custom
ipcluster_config.py
, which is a modified version of my existing/working config for ipp 6 (see below).I can successfully start a cluster spawning two PBS jobs (controller and engine).
But if I run the following line, the notebook will only show that the kernel is busy and it will never actually finish.
Am I using the API incorrectly? What might be the problem?
ipcluster_config.py