PBSControllerLauncher: Unable to connect_client_sync()

lukas-koschmieder commented 3 years ago

I have been using ipyparallel 6 for a while and would like to migrate to ipyparallel 7 mainly due to fact that the new Cluster API enables you to manage the entire process through a Jupyter Notebook. Unfortunatelly, I am having difficulties to connect a client to my cluster.

I have created a new IPython profile adding a custom ipcluster_config.py, which is a modified version of my existing/working config for ipp 6 (see below).

I can successfully start a cluster spawning two PBS jobs (controller and engine).

import ipyparallel as ipp

cluster=ipp.Cluster(
    n=128, 
    controller_ip='*',
    profile='pbs'
)

# Using existing profile dir: '/network/datamic/home/lukask/.aixvipmap/.ipython/profile_pbs'

await cluster.start_cluster()

# Job submitted with job id: '23777'
# Starting 4 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
# Job submitted with job id: '23778'
# <Cluster(cluster_id='1635407068-xsc7', profile='pbs', controller=<running>, engine_sets=['1635407070'])>

But if I run the following line, the notebook will only show that the kernel is busy and it will never actually finish.

rc = cluster.connect_client_sync()

Am I using the API incorrectly? What might be the problem?

ipcluster_config.py

c.Cluster.engine_launcher_class = 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'

c.Cluster.controller_launcher_class = 'ipyparallel.cluster.launcher.PBSControllerLauncher'

c.PBSControllerLauncher.batch_template = '''
#PBS -N ipcontroller
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1

cd $PBS_O_WORKDIR

conda activate ipp7

ipcontroller --profile-dir={profile_dir}
'''

c.PBSEngineSetLauncher.batch_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4

cd $PBS_O_WORKDIR

conda activate ipp7

module load intel
mpiexec -n {n} ipengine --profile-dir={profile_dir}
'''

lukas-koschmieder commented 3 years ago

The program hangs in the while loop at https://github.com/ipython/ipyparallel/blob/527d7b2264c5ecca012c6d248990dc18c1058834/ipyparallel/cluster/launcher.py#L339 because the config files created in PROFILE/security are named ipcontroller-client.json and ipcontroller-engine.json whereas the program expects the filenames to include cluster_id, e.g. ipcontroller-1635416202-ou8n-client.json, ipcontroller-1635416202-ou8n-engine.json. Is this also fixed in #606?

minrk commented 3 years ago

This is why I need to get CI tests for all the non-slurm batch launchers (#604)!

I do believe the issue is fixed in dev, but those custom templates will still reintroduce the problem. If you add --cluster-id={cluster_id} it will be fixed. The generic fix is to use {program_and_args} instead of ipengine --profile-dir={profile_dir}.

I believe these templates will work:

c.PBSControllerLauncher.batch_template = '''
#PBS -N ipcontroller
#PBS -V
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1

cd $PBS_O_WORKDIR

conda activate ipp7

{program_and_args}
'''

c.PBSEngineSetLauncher.batch_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -V
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4

cd $PBS_O_WORKDIR

conda activate ipp7

module load intel
mpiexec -n {n} {program_and_args}
'''

The next release uses environment variables to pass things like the cluster id, which means you must add #PBS -V to your options.

lukas-koschmieder commented 3 years ago

Thank you for the quick reply! The general method works. 👍

Is it possible to instantiate a Cluster without an existing IPython profile and ipcluster_config.py by passing c.PBSControllerLauncher.batch_template and c.PBSEngineSetLauncher.batch_template somehow directly to the class constructor from a Jupyter Notebook?

Pseudocode:

controller_template='''
#PBS -N ipcontroller
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=1
##PBS -q {queue}

cd $PBS_O_WORKDIR

conda activate ipp7

{program_and_args}
'''

engine_template = '''
#PBS -N ipengine
#PBS -j oe
#PBS -l walltime=01:00:00
#PBS -l nodes={n//4}:ppn=4
##PBS -q {queue}

cd $PBS_O_WORKDIR

conda activate ipp7

module load intel
mpiexec -n {n} {program_and_args}
'''

cluster=ipp.Cluster(
    n=4, 
    controller_ip='*',
    profile='pbs-2021-10-28',
    extra_options={ 
        'c.PBSControllerLauncher.batch_template':controller_template,
        'c.PBSEngineSetLauncher.batch_template':engine_template
    })

minrk commented 3 years ago

Yes! You populate the cluster.config object, which is the same as c in your ipcluster_config.py:

cluster=ipp.Cluster(
    n=4, 
    controller_ip='*',
    profile='pbs-2021-10-28',
)
 # this is the same config object you would configure in ipcluster_config.py
# you don't have to call it `c`, but if you do, the rest will look familiar
c = cluster.config

c.PBSControllerLauncher.batch_template = controller_template
c.PBSEngineSetLauncher.batch_template = engine_template

await cluster.start_cluster()

lukas-koschmieder commented 3 years ago

Fantastic! Thank you!

minrk commented 3 years ago

Adding lots of examples to my documentation todo list...

minrk commented 3 years ago

I'll make an 8.0 beta tomorrow. It would be great if you could test it out!

lukas-koschmieder commented 3 years ago

Okay, great, I will test it.

lukas-koschmieder commented 3 years ago

I've got another question and potential point for the documentation todo list: How do you configure the controller dynamically in Python / Jupyter Notebook (replacement for ipcontroller_config.py)? For instance, how would you set c.HubFactory.ip='*'?

minrk commented 3 years ago

That can be c.Cluster.controller_ip via config, or since it's on the Cluster object, it can be a constructor argument:

Cluster(controller_ip="*")

HubFactory is removed and replaced by IPController, so if you do still have an ipcontroller_config.pyit would bec.IPController.ip = '*'`.

The ambiguity is because there are really two things you are configuring:

Cluster (and thereby Launchers) which start processes like ipcontroller, and
ipcontroller, ipengine themselves

Some common options for configuring the controller itself can be done on the Cluster, but for the most part ipcontroller is configured directly through either ipcontroller_config.py or ControllerLauncher.controller_args. Cluster(controller_ip="*") is really a shortcut for c.ControllerLauncher.controler_args.append("--ip=*").

minrk commented 3 years ago

I just published 8.0.0b1 if you could give it a try

lukas-koschmieder commented 3 years ago

Okay, I'm currently in the middle of something but I will give it a try this afternoon/evening.

minrk commented 3 years ago

No rush! I've only got a few more minutes of work before the weekend. I'll probably aim to do a release around the end of next week.

lukas-koschmieder commented 3 years ago

I've installed the new beta version 8.0.0b1 and everything is looking fine - except for start_cluster_sync, which now appears to be significantly slower. The controller starts immediately but there is some noticeable delay before the engines come up. I haven't tested if this delay scales with the number of engines. I was using 4 engines in my test.

import time
start = time.time()
cluster.start_cluster_sync()
end = time.time()
print(end - start)

`8.0.0b1` output

Job submitted with job id: '23909'
Starting 4 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
Job submitted with job id: '23910'

30.14998745918274

`7.1.0` (conda-forge) output

Job submitted with job id: '23911'
Starting 6 engines with <class 'ipyparallel.cluster.launcher.PBSEngineSetLauncher'>
Job submitted with job id: '23912'

1.15324068069458

Edit: If the release is next week, unfortunately I won't be able to participate in additional beta testing because I am on holiday until Nov 8th.

minrk commented 3 years ago

except for start_cluster_sync, which now appears to be significantly slower.

That makes sense. It's the new Cluster.send_engines_connection_env option, which means by default start_cluster waits for the controller to finish starting before starting the engines, because the connection info is passed via environment through the Launcher. To disable this and rely on the connection files on disk (pre-8.0 behavior):

cluster = Cluster(send_engines_connection_env=False, engines='pbs', controller='pbs', cnotroller_ip='*')

then the engine and controller jobs should both be submitted immediately.

minrk commented 3 years ago

@lukas-koschmieder I just published 8.0.0rc1. Can you test and then close here if you think everything is resolved?

ipython / ipyparallel