sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

mini-ranks fail on large jobs, easy to fix #111

Closed sebhtml closed 11 years ago

sebhtml commented 11 years ago

Ray: code/plugin_Mock/Parameters.cpp:1811: Rank Parameters::getRankFromGlobalId(ReadHandle): Assertion `rank>=0' failed. [r101-n88:27376] * Process received signal * [r101-n88:27376] Signal: Aborted (6) [r101-n88:27376] Signal code: (-6) [r101-n88:27376] [ 0] /lib64/libpthread.so.0 [0x7fd995b6dbe0] [r101-n88:27376] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x7fd995838285] [r101-n88:27376] [ 2] /lib64/libc.so.6(abort+0x110) [0x7fd995839d30] [r101-n88:27376] [ 3] /lib64/libc.so.6(__assert_fail+0xf6) [0x7fd995831706] [r101-n88:27376] [ 4] Ray [0x4c5894] [r101-n88:27376] [ 5] Ray(_ZN15SequencesLoader16registerSequenceEv+0x421) [0x54e851] [r101-n88:27376] [ 6] Ray(_ZN15SequencesLoader34call_RAY_SLAVE_MODE_LOAD_SEQUENCESEv+0x6c6) [0x54f806] [r101-n88:27376] [ 7] Ray(_ZN11ComputeCore10runVanillaEv+0x133) [0x574263]

Error: rank is < 0, rank: -1444972293 ReadHandle: 18446744073705997812 elementsPerRank: 2614636

$ cat Human-HiSeq-2500-2012-12-06-2/FilePartition.txt

File Name FirstSequence LastSequence NumberOfSequences

0 HiSeq2500-NA12878/sorted_S1_L001_R1_001.fastq.gz 0 143818692 143818693 1 HiSeq2500-NA12878/sorted_S1_L001_R1_002.fastq.gz 143818693 293792112 149973420 2 HiSeq2500-NA12878/sorted_S1_L001_R2_001.fastq.gz 293792113 437610805 143818693 3 HiSeq2500-NA12878/sorted_S1_L001_R2_002.fastq.gz 437610806 587584225 149973420 4 HiSeq2500-NA12878/sorted_S1_L002_R1_001.fastq.gz 587584226 731879531 144295306 5 HiSeq2500-NA12878/sorted_S1_L002_R1_002.fastq.gz 731879532 879470762 147591231 6 HiSeq2500-NA12878/sorted_S1_L002_R2_001.fastq.gz 879470763 1023766068 144295306 7 HiSeq2500-NA12878/sorted_S1_L002_R2_002.fastq.gz 1023766069 1171357299 147591231 $

sebhtml commented 11 years ago

$ cat Run.sh

!/bin/bash

PBS -S /bin/bash

PBS -N Human-HiSeq-2500-2012-12-06-3

PBS -o Human-HiSeq-2500-2012-12-06-3.stdout

PBS -e Human-HiSeq-2500-2012-12-06-3.stderr

PBS -A nne-790-ab

PBS -l walltime=48:00:00

PBS -l qos=SPJ1024

PBS -l nodes=64:ppn=8

cd $PBS_O_WORKDIR

source /rap/nne-790-ab/software/NGS-Pipelines/LoadModules.sh

mpiexec -n 64 -bynode \ Ray -mini-ranks-per-rank 7 \ -o \ Human-HiSeq-2500-2012-12-06-3 \ -k \ 31 \ -p HiSeq2500-NA12878/sorted_S1_L001_R1_001.fastq.gz \ HiSeq2500-NA12878/sorted_S1_L001_R1_002.fastq.gz \ -p HiSeq2500-NA12878/sorted_S1_L001_R2_001.fastq.gz \ HiSeq2500-NA12878/sorted_S1_L001_R2_002.fastq.gz \ -p HiSeq2500-NA12878/sorted_S1_L002_R1_001.fastq.gz \ HiSeq2500-NA12878/sorted_S1_L002_R1_002.fastq.gz \ -p HiSeq2500-NA12878/sorted_S1_L002_R2_001.fastq.gz \ HiSeq2500-NA12878/sorted_S1_L002_R2_002.fastq.gz \

sebhtml commented 11 years ago

Which rank is it ?

sebhtml commented 11 years ago

ReadHandle: 18446744073705997812

2^64-1 is 18446744073709551615

18446744073709551615-18446744073705997812 => 3553803

but elementsPerRank is 2614636

sebhtml commented 11 years ago

$ grep Assertion Human-HiSeq-2500-2012-12-06-4.stderr Ray: code/plugin_SequencesLoader/SequencesLoader.cpp:114: void SequencesLoader::registerSequence(): Assertion `leftSequenceGlobalId<m_totalNumberOfSequences' failed.

sebhtml commented 11 years ago

Human-HiSeq-2500-2012-12-06-8.stderr

Ray: code/plugin_SequencesLoader/SequencesLoader.cpp:125: void SequencesLoader::registerSequence(): Assertion `leftSequenceGlobalId<m_totalNumberOfSequences' failed.

Error: invalid ReadHandle object, leftSequenceGlobalId: 18446744073705997812 m_totalNumberOfSequences: 1171357300 rightSequenceGlobalId: 146419616 m_distribution_currentSequenceId 146419616 m_loader.size: 149973420 rightSequenceIdOnRank: 0 m_myReads->size: 1

$ less Human-HiSeq-2500-2012-12-06-8/FilePartition.txt

File Name FirstSequence LastSequence NumberOfSequences

0 HiSeq2500-NA12878/sorted_S1_L001_R1_001.fastq.gz 0 143818692 143818693 1 HiSeq2500-NA12878/sorted_S1_L001_R1_002.fastq.gz 143818693 293792112 149973420 2 HiSeq2500-NA12878/sorted_S1_L001_R2_001.fastq.gz 293792113 437610805 143818693 3 HiSeq2500-NA12878/sorted_S1_L001_R2_002.fastq.gz 437610806 587584225 149973420 4 HiSeq2500-NA12878/sorted_S1_L002_R1_001.fastq.gz 587584226 731879531 144295306 5 HiSeq2500-NA12878/sorted_S1_L002_R1_002.fastq.gz 731879532 879470762 147591231 6 HiSeq2500-NA12878/sorted_S1_L002_R2_001.fastq.gz 879470763 1023766068 144295306 7 HiSeq2500-NA12878/sorted_S1_L002_R2_002.fastq.gz 1023766069 1171357299 147591231

$ cat HiSeq2500-NA12878/Counts sorted_S1_L001_R1_001.fastq.gz 143818693 sorted_S1_L001_R1_002.fastq.gz 149973420 sorted_S1_L001_R2_001.fastq.gz 143818693 sorted_S1_L001_R2_002.fastq.gz 149973420 sorted_S1_L002_R1_001.fastq.gz 144295306 sorted_S1_L002_R1_002.fastq.gz 147591231 sorted_S1_L002_R2_001.fastq.gz 144295306 sorted_S1_L002_R2_002.fastq.gz 147591231

This is not a bug: it's just that files are not paired properly.

Need to add a better error message