Open dz24 opened 1 day ago
Hi @dz24,
Yes, this is the right place to ask - thanks for opening the ticket!
You are overwriting the configuration for local.localhost
. While this is not the source of your problem, I still would like to suggest to instead to create a new json file (resource_dz24.json
or so), or to at least add a new entry (mycluster
or so) to resource_local.json
.
To the problem: RP is able to place your MPI tasks on the different nodes as you expect. However, to do so it needs to obtain information about what nodes are available to your specific job allocation. That information is usually provided by the cluster's resource manager (i.e., batch system) which in your case is PBS. However, the resource config you pasted above shows the resource_manager
entry as fork
- and that resource manager interface will only see the node the RP agent is running on (node1
), not the other nodes in your allocation.
So please change the resource_manager
entry to PBSPRO
, and let us know how that goes.
Best, Andre.
Hi, I am interested in using radical.pilot and it seems to work well for my purposes when it comes to running multiple independent Molecular Dynamics (MD) simulations on one node, within one HPC job. I have a question and hope this place is fine to ask it: I cannot seem to make the following situation work:
I want to run 4 tasks, e.g. "mpirun -np 32 cp2k", on two nodes having 64 cores each within one HPC job. The most efficient way to do this would be to run all 4 tasks in parallel, with one node working on two tasks at the same time. However, with different rp configurations, like using one or two pilots, I seem to get the same result of all four tasks seemingly being ran on one single node (despite having two nodes available), resulting in the performance being much worse than if I only ran a job with two tasks on one node instead. I am unable to see in the documentation or ways to make the individual tasks be exclusively distributed to individual nodes.
PBS script:
radical pilot script:
localhost json file:
This problem may be "trivial" with SLURM+srun , but now I am running on a PBS cluster which only offer mpirun. Is what I ask for possible with radical.pilot?
edit: Evidence for I believe all tasks end up being ran on one node, is, through grepping various rp files, all HOSTNAME refers to one node specifically, e.g.
and nothing comes up when searching for 'node2'. I imagine the problem could be fixed by modifying the -host input, which now is