Closed sebhtml closed 10 years ago
/rap/nne-790-ab/data/HiSeq-2500-NA12878-demo-2x150
$ ls -lh *.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 18G Nov 21 2012 sorted_S1_L001_R1_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L001_R1_002.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L001_R2_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 22 2012 sorted_S1_L001_R2_002.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 18G Nov 21 2012 sorted_S1_L002_R1_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 18G Nov 21 2012 sorted_S1_L002_R1_002.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L002_R2_001.fastq.gz -rw-rwxr--+ 1 sboisver12 nne-790-01 19G Nov 21 2012 sorted_S1_L002_R2_002.fastq.gz
$ cat HiSeq-2500-NA12878-demo-2x150.sh
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-1
#PBS -o HiSeq-2500-NA12878-demo-2x150-1.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-1.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00
#PBS -l nodes=64:ppn=8
cd $PBS_O_WORKDIR
module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/Ray/2.3.0-devel-b3e6b07764f71318408de5fbe632a41ae29c2105-1
mpiexec -n 512 \
Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-1 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
$ msub HiSeq?-2500-NA12878-demo-2x150.sh
10446216
The backtracking code fails on loops:
$ grep -i Warning HiSeq-2500-NA12878-demo-2x150-1.stdout |head DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 1695000475 pathName 1695000475 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 4829000080 pathName 4829000080 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 3302000175 pathName 3302000175 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 5970000152 pathName 5970000152 DEBUG Warning, backtrackPath yielded (expected >= 3)1 1 DEBUG Warning backtrackPath failed m_seedName 5144000241 pathName 5144000241
no errors !!!
$ grep -i Error HiSeq-2500-NA12878-demo-2x150-1.stdout |wc -l 0
check the logs for HiSeq-2500-NA12878-demo-2x150-1 because r101-n57 failed but the MPI job continued.
See charts here: http://dskernel.blogspot.ca/2013/09/debugging-mpi-application-is-sometimes.html
Located in
colosse:/rap/nne-790-ab/projects/seb/tests-Titan-datasets
I need to avoid these repeats: $ grep m_visitedVertices HiSeq-2500-NA12878-demo-2x150-2.1.043|awk '{print $5}'|sort -r -n > HiSeq-2500-NA12878-demo-2x150-2.1.043.vertices $ head HiSeq-2500-NA12878-demo-2x150-2.1.043.vertices 1393 1393 1392 1392 1392 1391 1391 1391 1391 1391
with -run-profiler -with-profiler-details
HiSeq-2500-NA12878-demo-2x150-4 https://portail.calculquebec.ca/common/report/myjobs/colosse/10455352/0/#
$ msub HiSeq-2500-NA12878-demo-2x150.sh
10460821
The job-5 crashed without any error.
I think something may be wrong with colosse. To check if it is the case, I started a job on mp2. #198
Job *-6:
$ cat HiSeq-2500-NA12878-demo-2x150-6.sh
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-6
#PBS -o HiSeq-2500-NA12878-demo-2x150-6.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-6.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00
###########PBS -l walltime=00:03:00:00
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea
cd $PBS_O_WORKDIR
module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/seb-devtools/1.0.0
mpiexec -n 512 \
-output-filename HiSeq-2500-NA12878-demo-2x150-6 \
apps/ray/885e3010ccdb587e84b3d43f7a5e598b8f187c6f/Ray \
$Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-6 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
#-run-profiler -with-profiler-details \
$ msub HiSeq-2500-NA12878-demo-2x150-6.sh
10461958
job*-6 stalls here:
$ tail HiSeq-2500-NA12878-demo-2x150-6.1.018
Rank 18 processWorkerResult 4000/9231
Rank 18 processWorkerResult 4100/9231
Rank 18 processWorkerResult 4200/9231
Rank 18 processWorkerResult 4300/9231
Rank 18 processWorkerResult 4400/9231
Rank 18 processWorkerResult 4500/9231
Rank 18 processWorkerResult 4600/9231
Rank 18 processWorkerResult 4700/9231
Rank 18 processWorkerResult 4800/9231
Rank 18 processWorkerResult 4900/9231
$ ssh r101-n60
top:
top -n1 -b
top - 09:25:53 up 15 days, 5:08, 1 user, load average: 8.09, 8.03, 8.00
Tasks: 221 total, 9 running, 212 sleeping, 0 stopped, 0 zombie
Cpu(s): 86.1%us, 1.2%sy, 0.0%ni, 12.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24735700k total, 24647752k used, 87948k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 3635880k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10904 sboisver 20 0 2612m 2.4g 10m R 101.0 10.0 429:11.06 Ray
10905 sboisver 20 0 2611m 2.4g 10m R 101.0 10.0 429:33.57 Ray
10906 sboisver 20 0 2670m 2.4g 10m R 101.0 10.2 426:43.75 Ray
10910 sboisver 20 0 2655m 2.4g 10m R 101.0 10.2 429:26.96 Ray
10911 sboisver 20 0 2621m 2.4g 10m R 101.0 10.0 429:12.52 Ray
10907 sboisver 20 0 2617m 2.4g 9920 R 99.1 10.0 429:21.40 Ray
10908 sboisver 20 0 2622m 2.4g 9m R 99.1 10.0 426:34.27 Ray
10909 sboisver 20 0 2647m 2.4g 10m R 99.1 10.1 428:34.10 Ray
add memory usage.with -run-profiler in RayPlatform
I will profile memory usage with RayPlatform:
$ cat HiSeq-2500-NA12878-demo-2x150-7.sh
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-7
#PBS -o HiSeq-2500-NA12878-demo-2x150-7.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-7.stderr
#PBS -A nne-790-ac
#PBS -l walltime=00:03:00:00
#PBS -l nodes=64:ppn=8
cd $PBS_O_WORKDIR
module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/seb-devtools/1.0.0
mpiexec -n 512 \
-output-filename HiSeq-2500-NA12878-demo-2x150-7 \
apps/ray/f620d24a1a99de081e27102d6a1680ceaae94a8b-1/Ray \
$Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-7 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-debug \
$ msub HiSeq-2500-NA12878-demo-2x150-7.sh
10462315
This sounds like a memory leak.
from HiSeq-2500-NA12878-demo-2x150-7.1.178
Marker 1:
Rank 178 processWorkerResult 0/8987 [/dev/actor/rank/178] [RayPlatform] epoch ends at 227786 ms ! (tick # 142726239), length is 100 ms, VmData is 1081780 KiB Rank 178: RAY_SLAVE_MODE_MERGE_SEEDS Time= 227.79 s Speed= 43592 Sent= 1165 (processMessages: 848, processData: 317) Received= 1165 Balance= 0
Marker 2:
Rank 178 processWorkerResult 100/8987 [/dev/actor/rank/178] [RayPlatform] epoch ends at 290586 ms ! (tick # 176431420), length is 100 ms, VmData is 1235564 KiB Rank 178: RAY_SLAVE_MODE_MERGE_SEEDS Time= 290.59 s Speed= 36811 Sent= 1006 (processMessages: 759, processData: 247) Received= 1005 Balance= 1
In 1 minute, 230 MiB got allocated (or were not freed if it is a memory leak).
Another strange thing, the profiler is supposed to do his report every 100 ms I think.
Debug run showing the number of vertices visited:
$ msub HiSeq-2500-NA12878-demo-2x150-8.sh
10466848
$ msub HiSeq-2500-NA12878-demo-2x150-9.sh
10466885
/rap/nne-790-ab/data/HiSeq-2500-NA12878-demo-2x150 (145 GB, fastq.gz)