jtriley / StarCluster

StarCluster is an open source cluster-computing toolkit for Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
583 stars 313 forks source link

Failed to run with mpi, thanks! #583

Open lamz138138 opened 8 years ago

lamz138138 commented 8 years ago

Hi!

I had learn to use SGE these days. I tried to submit a job to run parallel in different nodes, since it need a lot of memory. Following the guide, the job is in state of 'qw'. After google with two days, I still didn't know how to solve it, any suggestion would be grateful!

Following is what had I done and some information I think may be useful:

  1. compile MPI: cd $HOME cp /opt/mpi-tests/src/*.c . cp /opt/mpi-tests/src/Makefile . make
  2. edit my script "soap.qsub":

    !/bin/bash

    $ -cwd

    $ -j y

    $ -S /bin/bash

    /opt/openmpi/bin/mpirun myCommand

  3. submit the job: qsub -pe orte 30 script.qsub
  4. monitor the job: lcy01@console /data7/lcy/zhongxm/NGS $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID 3842 0.60500 soap.qsub lcy01 qw 05/10/2016 15:00:11 16
  5. then the information about the job: lcy01@console /data7/lcy/zhongxm/NGS $ qstat -j 3842 job_number: 3842 exec_file: job_scripts/3842 submission_time: Tue May 10 15:00:11 2016 owner: lcy01 uid: 514 group: mobile gid: 505 sge_o_home: /data8/lcy01 sge_o_log_name: lcy01 sge_o_path: /data8/lcy01/bin:/data8/lcy01/script:/data8/lcy01/tools/root/bin:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/gridengine/bin/lx26-amd64:/data8/lcy01/bin sge_o_shell: /bin/bash sge_o_workdir: /data7/lcy/zhongxm/NGS/assembly/soap/demo sge_o_host: console account: sge cwd: /data7/lcy/zhongxm/NGS/assembly/soap/demo merge: y mail_list: lcy01@console.local notify: FALSE job_name: soap.qsub jobshare: 0 shell_list: NONE:/bin/bash env_list:
    script_file: soap.qsub parallel environment: orte range: 16 scheduling info: queue instance "all.q@c0103.local" dropped because it is temporarily not available queue instance "all.q@c0213.local" dropped because it is temporarily not available queue instance "all.q@c0216.local" dropped because it is temporarily not available queue instance "all.q@c0203.local" dropped because it is temporarily not available queue instance "all.q@c0206.local" dropped because it is temporarily not available queue instance "all.q@c0205.local" dropped because it is temporarily not available queue instance "all.q@c0210.local" dropped because it is temporarily not available queue instance "all.q@smp02.local" dropped because it is temporarily not available queue instance "all.q@smp01.local" dropped because it is temporarily not available queue instance "all.q@smp03.local" dropped because it is temporarily not available queue instance "all.q@c0218.local" dropped because it is temporarily not available queue instance "all.q@c0204.local" dropped because it is temporarily not available cannot run in queue "all.q" because PE "orte" is not in pe list cannot run in PE "orte" because it only offers 0 slots
  6. relate information lcy01@console /data7/lcy/zhongxm/NGS $ qconf -sql all.q

    lcy01@console /data7/lcy/zhongxm/NGS $ qconf -spl mpi mpich orte

    lcy01@console /data7/lcy/zhongxm/NGS $ qconf -sp orte pe_name orte slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE

    lcy01@console /data7/lcy/zhongxm/NGS $ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS global - - - - - - - c0102 lx26-amd64 16 0.02 47.2G 700.3M 1000.0M 4.5M c0103 lx26-amd64 16 0.01 47.2G 694.0M 1000.0M 0.0 c0104 lx26-amd64 16 0.03 47.2G 733.8M 1000.0M 2.7M c0105 lx26-amd64 16 0.00 47.2G 762.6M 1000.0M 3.4M c0201 lx26-amd64 16 0.11 47.2G 703.1M 1000.0M 7.9M c0202 lx26-amd64 16 18.52 47.2G 14.1G 1000.0M 5.7M c0203 lx26-amd64 16 0.07 47.2G 704.4M 1000.0M 0.0 c0204 lx26-amd64 16 - 47.2G - 1000.0M - c0205 lx26-amd64 16 0.04 47.2G 711.0M 1000.0M 0.0 c0206 lx26-amd64 16 0.03 47.2G 703.3M 1000.0M 0.0 c0207 lx26-amd64 16 17.99 47.2G 792.0M 1000.0M 4.4M c0208 lx26-amd64 16 0.02 47.2G 734.5M 1000.0M 4.3M c0209 lx26-amd64 16 0.03 47.2G 783.3M 1000.0M 0.0 c0210 lx26-amd64 16 0.03 47.2G 698.8M 1000.0M 0.0 c0211 lx26-amd64 16 0.02 47.2G 741.7M 1000.0M 3.5M c0212 lx26-amd64 16 0.00 47.2G 765.1M 1000.0M 2.0M c0213 lx26-amd64 16 0.02 47.2G 700.8M 1000.0M 0.0 c0214 lx26-amd64 16 0.02 47.2G 747.5M 1000.0M 0.0 c0215 lx26-amd64 16 0.03 47.2G 712.9M 1000.0M 0.0 c0216 lx26-amd64 16 0.02 47.2G 698.0M 1000.0M 0.0 c0217 lx26-amd64 16 0.00 47.2G 778.4M 1000.0M 0.0 c0218 lx26-amd64 16 0.02 47.2G 699.7M 1000.0M 0.0 c0219 lx26-amd64 16 0.01 47.2G 749.8M 1000.0M 4.4M c0220 lx26-amd64 32 0.05 47.2G 765.0M 1000.0M 0.0 c0301 lx26-amd64 16 - 47.2G - 1000.0M - c0302 lx26-amd64 16 - 47.2G - 1000.0M - c0303 lx26-amd64 16 - 47.2G - 1000.0M - c0304 lx26-amd64 16 - 47.2G - 1000.0M - c0305 lx26-amd64 16 - 47.2G - 1000.0M - c0306 lx26-amd64 16 - 47.2G - 1000.0M - c0307 lx26-amd64 16 - 47.2G - 1000.0M - c0308 lx26-amd64 16 - 47.2G - 1000.0M - c0309 lx26-amd64 16 - 47.2G - 1000.0M - c0310 lx26-amd64 16 - 47.2G - 1000.0M - c0311 lx26-amd64 16 - 47.2G - 1000.0M - c0312 lx26-amd64 16 - 47.2G - 1000.0M - c0313 lx26-amd64 16 - 47.2G - 1000.0M - c0314 lx26-amd64 16 - 47.2G - 1000.0M - c0315 lx26-amd64 16 - 47.2G - 1000.0M - c0316 lx26-amd64 16 - 47.2G - 1000.0M - c0317 lx26-amd64 16 - 47.2G - 1000.0M - c0318 lx26-amd64 16 - 47.2G - 1000.0M - c0319 lx26-amd64 16 - 47.2G - 1000.0M - c0320 lx26-amd64 16 - 47.2G - 1000.0M - g0101 - - - - - - - g0102 - - - - - - - g0103 - - - - - - - g0104 - - - - - - - g0105 - - - - - - - g0201 - - - - - - - g0202 - - - - - - - smp01 lx26-amd64 64 33.86 504.8G 284.2G 1000.0M 0.0 smp02 lx26-amd64 32 1.47 504.8G 312.1G 1000.0M 96.7M smp03 lx26-amd64 64 0.20 473.2G 70.2G 1000.0M 999.5M

machbio commented 8 years ago

@lamz138138 paste the output logs for your job..