sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Do a test run for HiSeq-2500-NA12878-demo-2x150 on colosse #195

Closed sebhtml closed 10 years ago

sebhtml commented 10 years ago

/rap/nne-790-ab/data/HiSeq-2500-NA12878-demo-2x150 (145 GB, fastq.gz)

sebhtml commented 10 years ago

/rap/nne-790-ab/data/HiSeq-2500-NA12878-demo-2x150

$ ls -lh *.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 18G Nov 21 2012 sorted_S1_L001_R1_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L001_R1_002.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L001_R2_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 22 2012 sorted_S1_L001_R2_002.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 18G Nov 21 2012 sorted_S1_L002_R1_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 18G Nov 21 2012 sorted_S1_L002_R1_002.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L002_R2_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L002_R2_002.fastq.gz

sebhtml commented 10 years ago
sebhtml commented 10 years ago
$ cat HiSeq-2500-NA12878-demo-2x150.sh 
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-1
#PBS -o HiSeq-2500-NA12878-demo-2x150-1.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-1.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00
#PBS -l nodes=64:ppn=8

cd $PBS_O_WORKDIR

module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/Ray/2.3.0-devel-b3e6b07764f71318408de5fbe632a41ae29c2105-1

mpiexec -n 512 \
Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-1 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
sebhtml commented 10 years ago

$ msub HiSeq?-2500-NA12878-demo-2x150.sh

10446216

sebhtml commented 10 years ago

The backtracking code fails on loops:

$ grep -i Warning HiSeq-2500-NA12878-demo-2x150-1.stdout |head DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 1695000475 pathName 1695000475 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 4829000080 pathName 4829000080 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 3302000175 pathName 3302000175 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 5970000152 pathName 5970000152 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 5144000241 pathName 5144000241

sebhtml commented 10 years ago

no errors !!!

$ grep -i Error HiSeq-2500-NA12878-demo-2x150-1.stdout |wc -l 0

sebhtml commented 10 years ago
sebhtml commented 10 years ago

check the logs for HiSeq-2500-NA12878-demo-2x150-1 because r101-n57 failed but the MPI job continued.

See charts here: http://dskernel.blogspot.ca/2013/09/debugging-mpi-application-is-sometimes.html

sebhtml commented 10 years ago

Located in

colosse:/rap/nne-790-ab/projects/seb/tests-Titan-datasets

sebhtml commented 10 years ago

I need to avoid these repeats: $ grep m_visitedVertices HiSeq-2500-NA12878-demo-2x150-2.1.043|awk '{print $5}'|sort -r -n > HiSeq-2500-NA12878-demo-2x150-2.1.043.vertices $ head HiSeq-2500-NA12878-demo-2x150-2.1.043.vertices 1393 1393 1392 1392 1392 1391 1391 1391 1391 1391

sebhtml commented 10 years ago

with -run-profiler -with-profiler-details

sebhtml commented 10 years ago

HiSeq-2500-NA12878-demo-2x150-4 https://portail.calculquebec.ca/common/report/myjobs/colosse/10455352/0/#

sebhtml commented 10 years ago

$ msub HiSeq-2500-NA12878-demo-2x150.sh

10460821

sebhtml commented 10 years ago

The job-5 crashed without any error.

I think something may be wrong with colosse. To check if it is the case, I started a job on mp2. #198

sebhtml commented 10 years ago

Job *-6:

$ cat HiSeq-2500-NA12878-demo-2x150-6.sh
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-6
#PBS -o HiSeq-2500-NA12878-demo-2x150-6.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-6.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00
###########PBS -l walltime=00:03:00:00
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea

cd $PBS_O_WORKDIR

module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/seb-devtools/1.0.0

mpiexec -n 512 \
-output-filename HiSeq-2500-NA12878-demo-2x150-6 \
apps/ray/885e3010ccdb587e84b3d43f7a5e598b8f187c6f/Ray \
$Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-6 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \

#-run-profiler -with-profiler-details \

$ msub HiSeq-2500-NA12878-demo-2x150-6.sh

10461958
sebhtml commented 10 years ago

job*-6 stalls here:

$ tail  HiSeq-2500-NA12878-demo-2x150-6.1.018
Rank 18 processWorkerResult 4000/9231
Rank 18 processWorkerResult 4100/9231
Rank 18 processWorkerResult 4200/9231
Rank 18 processWorkerResult 4300/9231
Rank 18 processWorkerResult 4400/9231
Rank 18 processWorkerResult 4500/9231
Rank 18 processWorkerResult 4600/9231
Rank 18 processWorkerResult 4700/9231
Rank 18 processWorkerResult 4800/9231
Rank 18 processWorkerResult 4900/9231
sebhtml commented 10 years ago

$ ssh r101-n60

top:

top -n1 -b

top - 09:25:53 up 15 days,  5:08,  1 user,  load average: 8.09, 8.03, 8.00
Tasks: 221 total,   9 running, 212 sleeping,   0 stopped,   0 zombie
Cpu(s): 86.1%us,  1.2%sy,  0.0%ni, 12.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24735700k total, 24647752k used,    87948k free,        0k buffers
Swap:        0k total,        0k used,        0k free,  3635880k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
10904 sboisver  20   0 2612m 2.4g  10m R 101.0 10.0 429:11.06 Ray               
10905 sboisver  20   0 2611m 2.4g  10m R 101.0 10.0 429:33.57 Ray               
10906 sboisver  20   0 2670m 2.4g  10m R 101.0 10.2 426:43.75 Ray               
10910 sboisver  20   0 2655m 2.4g  10m R 101.0 10.2 429:26.96 Ray               
10911 sboisver  20   0 2621m 2.4g  10m R 101.0 10.0 429:12.52 Ray               
10907 sboisver  20   0 2617m 2.4g 9920 R 99.1 10.0 429:21.40 Ray                
10908 sboisver  20   0 2622m 2.4g   9m R 99.1 10.0 426:34.27 Ray                
10909 sboisver  20   0 2647m 2.4g  10m R 99.1 10.1 428:34.10 Ray   
sebhtml commented 10 years ago

add memory usage.with -run-profiler in RayPlatform

sebhtml commented 10 years ago

I will profile memory usage with RayPlatform:

$ cat HiSeq-2500-NA12878-demo-2x150-7.sh                                                                                                               
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-7
#PBS -o HiSeq-2500-NA12878-demo-2x150-7.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-7.stderr
#PBS -A nne-790-ac
#PBS -l walltime=00:03:00:00
#PBS -l nodes=64:ppn=8

cd $PBS_O_WORKDIR

module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/seb-devtools/1.0.0

mpiexec -n 512 \
-output-filename HiSeq-2500-NA12878-demo-2x150-7 \
apps/ray/f620d24a1a99de081e27102d6a1680ceaae94a8b-1/Ray \
$Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-7 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-debug \

$ msub HiSeq-2500-NA12878-demo-2x150-7.sh

10462315
sebhtml commented 10 years ago

This sounds like a memory leak.

from HiSeq-2500-NA12878-demo-2x150-7.1.178

Marker 1:

Rank 178 processWorkerResult 0/8987 [/dev/actor/rank/178] [RayPlatform] epoch ends at 227786 ms ! (tick # 142726239), length is 100 ms, VmData is 1081780 KiB Rank 178: RAY_SLAVE_MODE_MERGE_SEEDS Time= 227.79 s Speed= 43592 Sent= 1165 (processMessages: 848, processData: 317) Received= 1165 Balance= 0

Marker 2:

Rank 178 processWorkerResult 100/8987 [/dev/actor/rank/178] [RayPlatform] epoch ends at 290586 ms ! (tick # 176431420), length is 100 ms, VmData is 1235564 KiB Rank 178: RAY_SLAVE_MODE_MERGE_SEEDS Time= 290.59 s Speed= 36811 Sent= 1006 (processMessages: 759, processData: 247) Received= 1005 Balance= 1

In 1 minute, 230 MiB got allocated (or were not freed if it is a memory leak).

Another strange thing, the profiler is supposed to do his report every 100 ms I think.

sebhtml commented 10 years ago

Debug run showing the number of vertices visited:

$ msub HiSeq-2500-NA12878-demo-2x150-8.sh

10466848

sebhtml commented 10 years ago

$ msub HiSeq-2500-NA12878-demo-2x150-9.sh

10466885

sebhtml commented 10 years ago

PASS

http://dskernel.blogspot.ca/2013/10/the-polytope-router-and-human-genomes.html