Running 4 MPI-dependent tasks in parallel on two compute nodes, how?

Hi, I am interested in using radical.pilot and it seems to work well for my purposes when it comes to running multiple independent Molecular Dynamics (MD) simulations on one node, within one HPC job. I have a question and hope this place is fine to ask it: I cannot seem to make the following situation work:

I want to run 4 tasks, e.g. "mpirun -np 32 cp2k", on two nodes having 64 cores each within one HPC job. The most efficient way to do this would be to run all 4 tasks in parallel, with one node working on two tasks at the same time. However, with different rp configurations, like using one or two pilots, I seem to get the same result of all four tasks seemingly being ran on one single node (despite having two nodes available), resulting in the performance being much worse than if I only ran a job with two tasks on one node instead. I am unable to see in the documentation or ways to make the individual tasks be exclusively distributed to individual nodes.

PBS script:

#!/bin/sh
#PBS -l select=2:ncpus=64:mpiprocs=64:ompthreads=1
#PBS -l walltime=00:10:00

# module version
module -s purge
module -s load cp2k/2024.2

# run radical pilot script
source ~/venvs/p39/bin/activate
python3 rp_script.py

radical pilot script:

import radical.pilot as rp
import time
import os

PWD = os.path.abspath(os.path.dirname(__file__))

# Create a session
session = rp.Session()

# Create a pilot manager and a pilot
pmgr    = rp.PilotManager(session=session)
pd_init = {'resource': 'local.localhost',
           'runtime' : 30,
           'cores'   : 128}
pdesc   = rp.PilotDescription(pd_init)
pilot   = pmgr.submit_pilots(pdesc)

# Crate a task manager and describe your tasks
tmgr = rp.TaskManager(session=session)
tmgr.add_pilots(pilot)
tds = list()
for i in range(4):
    td = rp.TaskDescription()
    td.executable     = '/apl/cp2k/2024.2/exe/rccs/cp2k.psmp'
    td.arguments      = [f'H2O-64.inp']
    td.ranks          = 32
    td.cores_per_rank = 1
    td.threading_type   = rp.OpenMP
    td.input_staging  = [f'{PWD}/H2O-64.inp']
    tds.append(td)
    print('submit', i)

# Submit your tasks for execution
tasks = tmgr.submit_tasks(tds)
start = time.perf_counter()

print(f'time taken: {time.perf_counter() - start:.02f}')
for task in tasks:
    print(f"{task.stdout.strip()} {task.stderr.strip()}")

# Close your session
session.close(cleanup=True)

localhost json file:

{
    "localhost": {
        "description"                 : "Your local machine.",
        "default_queue"               : "H",
        "notes"                       : "To use the ssh schema, make sure that ssh access to localhost is enabled.",
        "default_schema"              : "local",
        "schemas"                     : {
            "ssh"                     : {
                "job_manager_endpoint": "ssh://localhost",
                "filesystem_endpoint" : "sftp://localhost"
            },
            "local"                   : {
                "job_manager_endpoint": "fork://localhost",
                "filesystem_endpoint" : "file://localhost"
            }
        },
        "default_remote_workdir"      : "$HOME",
        "resource_manager"            : "FORK",
        "agent_config"                : "default",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "launch_methods"              : {
                                         "order" : ["MPIRUN"],
                                         "MPIRUN" : {
                                             "pre_exec_cached": [
                                                 "module load 2024",
                                                 "module -s load cp2k/2024.2",
                                                 "module load cuda/12.2u2"
                                             ]
                                         }
                                        },
        "pre_bootstrap_0"             : ["module -s load cp2k/2024.2",
                                         "module load 2024",
                                         "module load cuda/12.2u2"],
        "rp_version"                  : "installed",
        "virtenv_mode"                : "local",
        "python_dist"                 : "default",
        "cores_per_node"              : 64,
        "gpus_per_node"               : 1,
        "lfs_path_per_node"           : "/tmp",
        "lfs_size_per_node"           : 1024,
        "mem_per_node"                : 504000,
        "fake_resources"              : true,
        "raptor"                      : {
                                         "hb_delay"     : 100,
                                         "hb_timeout"   : 400,
                                         "hb_frequency" : 100}
    }
}

This problem may be "trivial" with SLURM+srun , but now I am running on a PBS cluster which only offer mpirun. Is what I ask for possible with radical.pilot?

edit: Evidence for I believe all tasks end up being ran on one node, is, through grepping various rp files, all HOSTNAME refers to one node specifically, e.g.

pilot.0001/env/lm_mpirun.sh:export HOSTNAME='node1'

and nothing comes up when searching for 'node2'. I imagine the problem could be fixed by modifying the -host input, which now is

pilot.0001/task.000003/task.000003.launch.sh:  mpirun  -np 32   -host localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost

radical-cybertools / radical.pilot

Running 4 MPI-dependent tasks in parallel on two compute nodes, how? #3237