sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Segmentation fault on 2048 MPI ranks #134

Closed sebhtml closed 11 years ago

sebhtml commented 11 years ago

This on REDACTED from last night 2048 MPI tasks: (from stderr)

Stack walkback for Rank 1992 starting:

_start@start.S:113 __libc_start_main@0x2aaab0d74c35 main@ray_main.cpp:32 RankProcess::run()@RankProcess.h:214 RankProcess::startMiniRank()@RankProcess.h:294 Machine::run()@stl_construct.h:83 Machine::start()@stl_construct.h:83 ComputeCore::run()@stl_iterator.h:858 ComputeCore::runVanilla()@stl_iterator.h:858 ComputeCore::processData()@stl_iterator.h:858 SlaveModeExecutor::callHandler(int)@stl_iterator.h:858 Adapter_RAY_SLAVE_MODE_ADD_EDGES::call()@stl_algobase.h:217 VerticesExtractor::call_RAY_SLAVE_MODE_ADD_EDGES()@stl_algobase.h:217 Read::getSeq(char*, bool, bool) const@stl_algobase.h:217

Stack walkback for Rank 1992 done

Process died with signal 11: 'Segmentation fault'

Forcing core dumps of ranks 1992, 61, 218, 267, 271, 361, 510, 86, 1258, 8, 126, 293, 106, 373, 688, 59, 129, 145, 208, 17

The line numbers are not in Ray, but in the code that prints the stack.

sebhtml commented 11 years ago

VerticesExtractor::call_RAY_SLAVE_MODE_ADD_EDGES()@stl_algobase.h:217 => line 118 Read::getSeq(char*, bool, bool) const@stl_algobase.h:217 => Unknown line obviously (Lines 161 to 174)

Read::getSeq():

void Read::getSeq(char*workingBuffer,bool color,bool doubleEncoding) const{
        for(int position=0;position>6);
                if(!doubleEncoding)
                        color=false;
                char nucleotide=codeToChar(code,color);
                workingBuffer[position]=nucleotide;
        }
        workingBuffer[m_length]='\0';
}

workingBuffer is 65536 long, but reads in this sample are 150 nt.

I don't see where the bug is and why it work on the same machine, same code, but with a different number of MPI Ranks.

sebhtml commented 11 years ago

VFS path: /rap/nne-790-ac/Cray/2048-2013-01-10-1-Seg-Fault

Stack:

$ head ray_n2048.e470670 -n 20
Application 8277109 is crashing. ATP analysis proceeding...
Stack walkback for Rank 1992 starting:
  _start@start.S:113
  __libc_start_main@0x2aaab0d74c35
  main@ray_main.cpp:32
  RankProcess::run()@RankProcess.h:214
  RankProcess::startMiniRank()@RankProcess.h:294
  Machine::run()@stl_construct.h:83
  Machine::start()@stl_construct.h:83
  ComputeCore::run()@stl_iterator.h:858
  ComputeCore::runVanilla()@stl_iterator.h:858
  ComputeCore::processData()@stl_iterator.h:858
  SlaveModeExecutor::callHandler(int)@stl_iterator.h:858
  Adapter_RAY_SLAVE_MODE_ADD_EDGES::call()@stl_algobase.h:217
  VerticesExtractor::call_RAY_SLAVE_MODE_ADD_EDGES()@stl_algobase.h:217
  Read::getSeq(char*, bool, bool) const@stl_algobase.h:217
Stack walkback for Rank 1992 done
Process died with signal 11: 'Segmentation fault'
Forcing core dumps of ranks 1992, 61, 218, 267, 271, 361, 510, 86, 1258, 8, 126, 293, 106, 373, 688, 59, 129, 145, 208, 17

stdout for Rank 1992 for the previous step:

stdout for Rank 1992 for the faulty step:

$ cat ray_n2048.o470670|grep "Rank 1992"|tail -n 20
Rank 1992 is counting k-mers in sequence reads [470001/571951]
Rank 1992 is counting k-mers in sequence reads [480001/571951]
Rank 1992 is counting k-mers in sequence reads [490001/571951]
Rank 1992 has 2500000 vertices
Rank 1992: assembler memory usage: 3406348 KiB
Rank 1992 is counting k-mers in sequence reads [500001/571951]
Rank 1992 is counting k-mers in sequence reads [510001/571951]
Rank 1992 is counting k-mers in sequence reads [520001/571951]
Rank 1992 is counting k-mers in sequence reads [530001/571951]
Rank 1992 is counting k-mers in sequence reads [540001/571951]
Rank 1992 is counting k-mers in sequence reads [550001/571951]
Rank 1992 is counting k-mers in sequence reads [560001/571951]
Rank 1992 is counting k-mers in sequence reads [570001/571951]
Rank 1992 is counting k-mers in sequence reads [571951/571951] (completed)
Rank 1992 : VirtualCommunicator (service provided by BufferedData): 74343426 virtual messages generated 149705 real messages (0.20137%)
Rank 1992 number of set bits in the Bloom filter: [ 23772829 / 268435456 ] (8.85607%)
Rank 1992 destroyed its Bloom filter
Rank 1992 has 2556244 k-mers (completed)
[BloomFilter] Rank 1992: k-mers sampled -> 6187068, k-mers dropped -> 3630824 (58.6841%), k-mers accepted -> 2556244 (41.3159%)
Rank 1992: assembler memory usage: 3407388 KiB

Faulty slave mode: RAY_SLAVE_MODE_ADD_EDGES

$ grep -n getSeq code//plugin_VerticesExtractor/VerticesExtractor.cpp
118:            (*m_myReads)[(m_mode_send_vertices_sequence_id)]->getSeq(m_readSequence,m_parameters->getColorSpaceMode(),false);

Previous (successful) slave mode: RAY_SLAVE_MODE_ADD_VERTICES

$ grep -n getSeq code//plugin_KmerAcademyBuilder/KmerAcademyBuilder.cpp
119:            (*m_myReads)[(m_mode_send_vertices_sequence_id)]->getSeq(m_readSequence,m_parameters->getColorSpaceMode(),false);

Presumably it may be something related to buggy variable scope in Ray that includes a race condition (because this code works at 512 on Cray XE6 and a Cray person ran 4096-MPI-rank jobs successfully too.

sebhtml commented 11 years ago

Regarding the segmentation fault, I don't understand why the memory usage would be 3315548 KiB before loading any sequence.

In /rap/nne-790-ac/Cray/2048-2013-01-10-1-Seg-Fault/ray_n2048.o470670 (on Colosse):

Rank 1992 has 0 sequence reads Rank 1992: assembler memory usage: 3315548 KiB

From a 512-rank job on the same dataset on Colosse: ( log HiSeq-2500-NA12878-demo-2x150-2012-12-18-1.stdout ):

Rank 258 has 0 sequence reads Rank 258: assembler memory usage: 210692 KiB

sebhtml commented 11 years ago

Probably fixed by this one:

https://github.com/sebhtml/RayPlatform/commit/d78e7ec5037c9c9e8a08160cb83864bbe67f658c