radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Running 4 MPI-dependent tasks in parallel on two compute nodes, how? #3237

Open dz24 opened 1 day ago

dz24 commented 1 day ago

Hi, I am interested in using radical.pilot and it seems to work well for my purposes when it comes to running multiple independent Molecular Dynamics (MD) simulations on one node, within one HPC job. I have a question and hope this place is fine to ask it: I cannot seem to make the following situation work:

I want to run 4 tasks, e.g. "mpirun -np 32 cp2k", on two nodes having 64 cores each within one HPC job. The most efficient way to do this would be to run all 4 tasks in parallel, with one node working on two tasks at the same time. However, with different rp configurations, like using one or two pilots, I seem to get the same result of all four tasks seemingly being ran on one single node (despite having two nodes available), resulting in the performance being much worse than if I only ran a job with two tasks on one node instead. I am unable to see in the documentation or ways to make the individual tasks be exclusively distributed to individual nodes.

PBS script:

#!/bin/sh
#PBS -l select=2:ncpus=64:mpiprocs=64:ompthreads=1
#PBS -l walltime=00:10:00

# module version
module -s purge
module -s load cp2k/2024.2

# run radical pilot script
source ~/venvs/p39/bin/activate
python3 rp_script.py

radical pilot script:

import radical.pilot as rp
import time
import os

PWD = os.path.abspath(os.path.dirname(__file__))

# Create a session
session = rp.Session()

# Create a pilot manager and a pilot
pmgr    = rp.PilotManager(session=session)
pd_init = {'resource': 'local.localhost',
           'runtime' : 30,
           'cores'   : 128}
pdesc   = rp.PilotDescription(pd_init)
pilot   = pmgr.submit_pilots(pdesc)

# Crate a task manager and describe your tasks
tmgr = rp.TaskManager(session=session)
tmgr.add_pilots(pilot)
tds = list()
for i in range(4):
    td = rp.TaskDescription()
    td.executable     = '/apl/cp2k/2024.2/exe/rccs/cp2k.psmp'
    td.arguments      = [f'H2O-64.inp']
    td.ranks          = 32
    td.cores_per_rank = 1
    td.threading_type   = rp.OpenMP
    td.input_staging  = [f'{PWD}/H2O-64.inp']
    tds.append(td)
    print('submit', i)

# Submit your tasks for execution
tasks = tmgr.submit_tasks(tds)
start = time.perf_counter()

print(f'time taken: {time.perf_counter() - start:.02f}')
for task in tasks:
    print(f"{task.stdout.strip()} {task.stderr.strip()}")

# Close your session
session.close(cleanup=True)

localhost json file:

{
    "localhost": {
        "description"                 : "Your local machine.",
        "default_queue"               : "H",
        "notes"                       : "To use the ssh schema, make sure that ssh access to localhost is enabled.",
        "default_schema"              : "local",
        "schemas"                     : {
            "ssh"                     : {
                "job_manager_endpoint": "ssh://localhost",
                "filesystem_endpoint" : "sftp://localhost"
            },
            "local"                   : {
                "job_manager_endpoint": "fork://localhost",
                "filesystem_endpoint" : "file://localhost"
            }
        },
        "default_remote_workdir"      : "$HOME",
        "resource_manager"            : "FORK",
        "agent_config"                : "default",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "launch_methods"              : {
                                         "order" : ["MPIRUN"],
                                         "MPIRUN" : {
                                             "pre_exec_cached": [
                                                 "module load 2024",
                                                 "module -s load cp2k/2024.2",
                                                 "module load cuda/12.2u2"
                                             ]
                                         }
                                        },
        "pre_bootstrap_0"             : ["module -s load cp2k/2024.2",
                                         "module load 2024",
                                         "module load cuda/12.2u2"],
        "rp_version"                  : "installed",
        "virtenv_mode"                : "local",
        "python_dist"                 : "default",
        "cores_per_node"              : 64,
        "gpus_per_node"               : 1,
        "lfs_path_per_node"           : "/tmp",
        "lfs_size_per_node"           : 1024,
        "mem_per_node"                : 504000,
        "fake_resources"              : true,
        "raptor"                      : {
                                         "hb_delay"     : 100,
                                         "hb_timeout"   : 400,
                                         "hb_frequency" : 100}
    }
}

This problem may be "trivial" with SLURM+srun , but now I am running on a PBS cluster which only offer mpirun. Is what I ask for possible with radical.pilot?

edit: Evidence for I believe all tasks end up being ran on one node, is, through grepping various rp files, all HOSTNAME refers to one node specifically, e.g.

pilot.0001/env/lm_mpirun.sh:export HOSTNAME='node1'

and nothing comes up when searching for 'node2'. I imagine the problem could be fixed by modifying the -host input, which now is

pilot.0001/task.000003/task.000003.launch.sh:  mpirun  -np 32   -host localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost
andre-merzky commented 21 hours ago

Hi @dz24,

Yes, this is the right place to ask - thanks for opening the ticket!

You are overwriting the configuration for local.localhost. While this is not the source of your problem, I still would like to suggest to instead to create a new json file (resource_dz24.json or so), or to at least add a new entry (mycluster or so) to resource_local.json.

To the problem: RP is able to place your MPI tasks on the different nodes as you expect. However, to do so it needs to obtain information about what nodes are available to your specific job allocation. That information is usually provided by the cluster's resource manager (i.e., batch system) which in your case is PBS. However, the resource config you pasted above shows the resource_manager entry as fork - and that resource manager interface will only see the node the RP agent is running on (node1), not the other nodes in your allocation.

So please change the resource_manager entry to PBSPRO, and let us know how that goes.

Best, Andre.