Open abilijang opened 5 years ago
On cluster, thunder should be run in the way that there is one process on each node. The CPU cores in each node should be used by threading.
Suppose that there are 20 CPU cores on each node. I believe the configuration should be
#PBS -l nodes=10:ppn=1
and
mpirun --bynode -np 10 thunder_cpu demo.json
Moreover, please make sure that the value of parameter "Number of Threads Per Process" be 20.
Hi, specialist. I was trying to run THUNDER on our cpu cluster, which contains 15 nodes and each node has 20 cores. we use PBS as our Job scheduling system. The submitted job works fine with other programs like relion, however, It came to some issues with THUNDER. Below are our original relion job script and THUNDER script.
relion:
!/bin/bash
Inherit all current environment variables
PBS -V
Job name
PBS -N Class2D/run1
Keep Output and Error
PBS -k eo
Queue name
PBS -q quick
Specify the number of nodes and thread (ppn) for your job.
PBS -l nodes=15:ppn=20
#################################
Switch to the working directory;
cd $PBS_O_WORKDIR
Environment
source ~/.bashrc
NP=
wc -l < $PBS_NODEFILE
Run:
echo "starting RELION..." mpirun --bynode -np 300
which relion_refine_mpi
--o Class2D/job001/run --i particles.star --dont_combine_weights_via_disc --pool 7 --ctf --iter 30 --tau2_fudge 2 --particle_diameter 420 --K 150 --flatten_solvent --zero_mask --oversampling 1 --psi_step 12 --offset_range 15 --offset_step 2 --norm --scale --j 1 echo "done"THUNDER
!/bin/bash
Inherit all current environment variables
PBS -V
Job name
PBS -N Class2D/run1
Keep Output and Error
PBS -k eo
Queue name
PBS -q quick
Specify the number of nodes and thread (ppn) for your job.
PBS -l nodes=10:ppn=20
#################################
Switch to the working directory;
cd $PBS_O_WORKDIR
Environment
source ~/.bashrc
NP=
wc -l < $PBS_NODEFILE
Run:
echo "starting THUNDER" mpirun --bynode -np 200 thunder_cpu demo.json echo "done"
The THUNDER job works only on the master node, and the error message says it did not recognize the bynode argument, but it works fine with relion job without this message. Does anyone have ideas?
Thanks for your help, Shuangbo