I had learn to use SGE these days. I tried to submit a job to run parallel in different nodes, since it need a lot of memory. Following the guide, the job is in state of 'qw'. After google with two days, I still didn't know how to solve it, any suggestion would be grateful!
Following is what had I done and some information I think may be useful:
compile MPI:
cd $HOME
cp /opt/mpi-tests/src/*.c .
cp /opt/mpi-tests/src/Makefile .
make
edit my script "soap.qsub":
!/bin/bash
$ -cwd
$ -j y
$ -S /bin/bash
/opt/openmpi/bin/mpirun myCommand
submit the job:
qsub -pe orte 30 script.qsub
monitor the job:
lcy01@console /data7/lcy/zhongxm/NGS $ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
3842 0.60500 soap.qsub lcy01 qw 05/10/2016 15:00:11 16
then the information about the job:
lcy01@console /data7/lcy/zhongxm/NGS $ qstat -j 3842
job_number: 3842
exec_file: job_scripts/3842
submission_time: Tue May 10 15:00:11 2016
owner: lcy01
uid: 514
group: mobile
gid: 505
sge_o_home: /data8/lcy01
sge_o_log_name: lcy01
sge_o_path: /data8/lcy01/bin:/data8/lcy01/script:/data8/lcy01/tools/root/bin:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/gridengine/bin/lx26-amd64:/data8/lcy01/bin
sge_o_shell: /bin/bash
sge_o_workdir: /data7/lcy/zhongxm/NGS/assembly/soap/demo
sge_o_host: console
account: sge
cwd: /data7/lcy/zhongxm/NGS/assembly/soap/demo
merge: y
mail_list: lcy01@console.local
notify: FALSE
job_name: soap.qsub
jobshare: 0
shell_list: NONE:/bin/bash
env_list:
script_file: soap.qsub
parallel environment: orte range: 16
scheduling info: queue instance "all.q@c0103.local" dropped because it is temporarily not available
queue instance "all.q@c0213.local" dropped because it is temporarily not available
queue instance "all.q@c0216.local" dropped because it is temporarily not available
queue instance "all.q@c0203.local" dropped because it is temporarily not available
queue instance "all.q@c0206.local" dropped because it is temporarily not available
queue instance "all.q@c0205.local" dropped because it is temporarily not available
queue instance "all.q@c0210.local" dropped because it is temporarily not available
queue instance "all.q@smp02.local" dropped because it is temporarily not available
queue instance "all.q@smp01.local" dropped because it is temporarily not available
queue instance "all.q@smp03.local" dropped because it is temporarily not available
queue instance "all.q@c0218.local" dropped because it is temporarily not available
queue instance "all.q@c0204.local" dropped because it is temporarily not available
cannot run in queue "all.q" because PE "orte" is not in pe list
cannot run in PE "orte" because it only offers 0 slots
relate information
lcy01@console /data7/lcy/zhongxm/NGS $ qconf -sql
all.q
Hi!
I had learn to use SGE these days. I tried to submit a job to run parallel in different nodes, since it need a lot of memory. Following the guide, the job is in state of 'qw'. After google with two days, I still didn't know how to solve it, any suggestion would be grateful!
Following is what had I done and some information I think may be useful:
edit my script "soap.qsub":
!/bin/bash
$ -cwd
$ -j y
$ -S /bin/bash
/opt/openmpi/bin/mpirun myCommand
script_file: soap.qsub parallel environment: orte range: 16 scheduling info: queue instance "all.q@c0103.local" dropped because it is temporarily not available queue instance "all.q@c0213.local" dropped because it is temporarily not available queue instance "all.q@c0216.local" dropped because it is temporarily not available queue instance "all.q@c0203.local" dropped because it is temporarily not available queue instance "all.q@c0206.local" dropped because it is temporarily not available queue instance "all.q@c0205.local" dropped because it is temporarily not available queue instance "all.q@c0210.local" dropped because it is temporarily not available queue instance "all.q@smp02.local" dropped because it is temporarily not available queue instance "all.q@smp01.local" dropped because it is temporarily not available queue instance "all.q@smp03.local" dropped because it is temporarily not available queue instance "all.q@c0218.local" dropped because it is temporarily not available queue instance "all.q@c0204.local" dropped because it is temporarily not available cannot run in queue "all.q" because PE "orte" is not in pe list cannot run in PE "orte" because it only offers 0 slots
relate information lcy01@console /data7/lcy/zhongxm/NGS $ qconf -sql all.q
lcy01@console /data7/lcy/zhongxm/NGS $ qconf -spl mpi mpich orte
lcy01@console /data7/lcy/zhongxm/NGS $ qconf -sp orte pe_name orte slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE
lcy01@console /data7/lcy/zhongxm/NGS $ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS global - - - - - - - c0102 lx26-amd64 16 0.02 47.2G 700.3M 1000.0M 4.5M c0103 lx26-amd64 16 0.01 47.2G 694.0M 1000.0M 0.0 c0104 lx26-amd64 16 0.03 47.2G 733.8M 1000.0M 2.7M c0105 lx26-amd64 16 0.00 47.2G 762.6M 1000.0M 3.4M c0201 lx26-amd64 16 0.11 47.2G 703.1M 1000.0M 7.9M c0202 lx26-amd64 16 18.52 47.2G 14.1G 1000.0M 5.7M c0203 lx26-amd64 16 0.07 47.2G 704.4M 1000.0M 0.0 c0204 lx26-amd64 16 - 47.2G - 1000.0M - c0205 lx26-amd64 16 0.04 47.2G 711.0M 1000.0M 0.0 c0206 lx26-amd64 16 0.03 47.2G 703.3M 1000.0M 0.0 c0207 lx26-amd64 16 17.99 47.2G 792.0M 1000.0M 4.4M c0208 lx26-amd64 16 0.02 47.2G 734.5M 1000.0M 4.3M c0209 lx26-amd64 16 0.03 47.2G 783.3M 1000.0M 0.0 c0210 lx26-amd64 16 0.03 47.2G 698.8M 1000.0M 0.0 c0211 lx26-amd64 16 0.02 47.2G 741.7M 1000.0M 3.5M c0212 lx26-amd64 16 0.00 47.2G 765.1M 1000.0M 2.0M c0213 lx26-amd64 16 0.02 47.2G 700.8M 1000.0M 0.0 c0214 lx26-amd64 16 0.02 47.2G 747.5M 1000.0M 0.0 c0215 lx26-amd64 16 0.03 47.2G 712.9M 1000.0M 0.0 c0216 lx26-amd64 16 0.02 47.2G 698.0M 1000.0M 0.0 c0217 lx26-amd64 16 0.00 47.2G 778.4M 1000.0M 0.0 c0218 lx26-amd64 16 0.02 47.2G 699.7M 1000.0M 0.0 c0219 lx26-amd64 16 0.01 47.2G 749.8M 1000.0M 4.4M c0220 lx26-amd64 32 0.05 47.2G 765.0M 1000.0M 0.0 c0301 lx26-amd64 16 - 47.2G - 1000.0M - c0302 lx26-amd64 16 - 47.2G - 1000.0M - c0303 lx26-amd64 16 - 47.2G - 1000.0M - c0304 lx26-amd64 16 - 47.2G - 1000.0M - c0305 lx26-amd64 16 - 47.2G - 1000.0M - c0306 lx26-amd64 16 - 47.2G - 1000.0M - c0307 lx26-amd64 16 - 47.2G - 1000.0M - c0308 lx26-amd64 16 - 47.2G - 1000.0M - c0309 lx26-amd64 16 - 47.2G - 1000.0M - c0310 lx26-amd64 16 - 47.2G - 1000.0M - c0311 lx26-amd64 16 - 47.2G - 1000.0M - c0312 lx26-amd64 16 - 47.2G - 1000.0M - c0313 lx26-amd64 16 - 47.2G - 1000.0M - c0314 lx26-amd64 16 - 47.2G - 1000.0M - c0315 lx26-amd64 16 - 47.2G - 1000.0M - c0316 lx26-amd64 16 - 47.2G - 1000.0M - c0317 lx26-amd64 16 - 47.2G - 1000.0M - c0318 lx26-amd64 16 - 47.2G - 1000.0M - c0319 lx26-amd64 16 - 47.2G - 1000.0M - c0320 lx26-amd64 16 - 47.2G - 1000.0M - g0101 - - - - - - - g0102 - - - - - - - g0103 - - - - - - - g0104 - - - - - - - g0105 - - - - - - - g0201 - - - - - - - g0202 - - - - - - - smp01 lx26-amd64 64 33.86 504.8G 284.2G 1000.0M 0.0 smp02 lx26-amd64 32 1.47 504.8G 312.1G 1000.0M 96.7M smp03 lx26-amd64 64 0.20 473.2G 70.2G 1000.0M 999.5M