Closed sebhtml closed 11 years ago
VerticesExtractor::call_RAY_SLAVE_MODE_ADD_EDGES()@stl_algobase.h:217 => line 118 Read::getSeq(char*, bool, bool) const@stl_algobase.h:217 => Unknown line obviously (Lines 161 to 174)
Read::getSeq():
void Read::getSeq(char*workingBuffer,bool color,bool doubleEncoding) const{ for(int position=0;position>6); if(!doubleEncoding) color=false; char nucleotide=codeToChar(code,color); workingBuffer[position]=nucleotide; } workingBuffer[m_length]='\0'; }
workingBuffer is 65536 long, but reads in this sample are 150 nt.
I don't see where the bug is and why it work on the same machine, same code, but with a different number of MPI Ranks.
VFS path: /rap/nne-790-ac/Cray/2048-2013-01-10-1-Seg-Fault
Stack:
$ head ray_n2048.e470670 -n 20 Application 8277109 is crashing. ATP analysis proceeding... Stack walkback for Rank 1992 starting: _start@start.S:113 __libc_start_main@0x2aaab0d74c35 main@ray_main.cpp:32 RankProcess::run()@RankProcess.h:214 RankProcess ::startMiniRank()@RankProcess.h:294 Machine::run()@stl_construct.h:83 Machine::start()@stl_construct.h:83 ComputeCore::run()@stl_iterator.h:858 ComputeCore::runVanilla()@stl_iterator.h:858 ComputeCore::processData()@stl_iterator.h:858 SlaveModeExecutor::callHandler(int)@stl_iterator.h:858 Adapter_RAY_SLAVE_MODE_ADD_EDGES::call()@stl_algobase.h:217 VerticesExtractor::call_RAY_SLAVE_MODE_ADD_EDGES()@stl_algobase.h:217 Read::getSeq(char*, bool, bool) const@stl_algobase.h:217 Stack walkback for Rank 1992 done Process died with signal 11: 'Segmentation fault' Forcing core dumps of ranks 1992, 61, 218, 267, 271, 361, 510, 86, 1258, 8, 126, 293, 106, 373, 688, 59, 129, 145, 208, 17
stdout for Rank 1992 for the previous step:
stdout for Rank 1992 for the faulty step:
$ cat ray_n2048.o470670|grep "Rank 1992"|tail -n 20 Rank 1992 is counting k-mers in sequence reads [470001/571951] Rank 1992 is counting k-mers in sequence reads [480001/571951] Rank 1992 is counting k-mers in sequence reads [490001/571951] Rank 1992 has 2500000 vertices Rank 1992: assembler memory usage: 3406348 KiB Rank 1992 is counting k-mers in sequence reads [500001/571951] Rank 1992 is counting k-mers in sequence reads [510001/571951] Rank 1992 is counting k-mers in sequence reads [520001/571951] Rank 1992 is counting k-mers in sequence reads [530001/571951] Rank 1992 is counting k-mers in sequence reads [540001/571951] Rank 1992 is counting k-mers in sequence reads [550001/571951] Rank 1992 is counting k-mers in sequence reads [560001/571951] Rank 1992 is counting k-mers in sequence reads [570001/571951] Rank 1992 is counting k-mers in sequence reads [571951/571951] (completed) Rank 1992 : VirtualCommunicator (service provided by BufferedData): 74343426 virtual messages generated 149705 real messages (0.20137%) Rank 1992 number of set bits in the Bloom filter: [ 23772829 / 268435456 ] (8.85607%) Rank 1992 destroyed its Bloom filter Rank 1992 has 2556244 k-mers (completed) [BloomFilter] Rank 1992: k-mers sampled -> 6187068, k-mers dropped -> 3630824 (58.6841%), k-mers accepted -> 2556244 (41.3159%) Rank 1992: assembler memory usage: 3407388 KiB
Faulty slave mode: RAY_SLAVE_MODE_ADD_EDGES
$ grep -n getSeq code//plugin_VerticesExtractor/VerticesExtractor.cpp 118: (*m_myReads)[(m_mode_send_vertices_sequence_id)]->getSeq(m_readSequence,m_parameters->getColorSpaceMode(),false);
Previous (successful) slave mode: RAY_SLAVE_MODE_ADD_VERTICES
$ grep -n getSeq code//plugin_KmerAcademyBuilder/KmerAcademyBuilder.cpp 119: (*m_myReads)[(m_mode_send_vertices_sequence_id)]->getSeq(m_readSequence,m_parameters->getColorSpaceMode(),false);
Presumably it may be something related to buggy variable scope in Ray that includes a race condition (because this code works at 512 on Cray XE6 and a Cray person ran 4096-MPI-rank jobs successfully too.
Regarding the segmentation fault, I don't understand why the memory usage would be 3315548 KiB before loading any sequence.
In /rap/nne-790-ac/Cray/2048-2013-01-10-1-Seg-Fault/ray_n2048.o470670 (on Colosse):
Rank 1992 has 0 sequence reads Rank 1992: assembler memory usage: 3315548 KiB
From a 512-rank job on the same dataset on Colosse: ( log HiSeq-2500-NA12878-demo-2x150-2012-12-18-1.stdout ):
Rank 258 has 0 sequence reads Rank 258: assembler memory usage: 210692 KiB
Probably fixed by this one:
https://github.com/sebhtml/RayPlatform/commit/d78e7ec5037c9c9e8a08160cb83864bbe67f658c
This on REDACTED from last night 2048 MPI tasks: (from stderr)
Stack walkback for Rank 1992 starting:
_start@start.S:113 __libc_start_main@0x2aaab0d74c35 main@ray_main.cpp:32 RankProcess::run()@RankProcess.h:214
RankProcess::startMiniRank()@RankProcess.h:294
Machine::run()@stl_construct.h:83
Machine::start()@stl_construct.h:83
ComputeCore::run()@stl_iterator.h:858
ComputeCore::runVanilla()@stl_iterator.h:858
ComputeCore::processData()@stl_iterator.h:858
SlaveModeExecutor::callHandler(int)@stl_iterator.h:858
Adapter_RAY_SLAVE_MODE_ADD_EDGES::call()@stl_algobase.h:217
VerticesExtractor::call_RAY_SLAVE_MODE_ADD_EDGES()@stl_algobase.h:217
Read::getSeq(char*, bool, bool) const@stl_algobase.h:217
Stack walkback for Rank 1992 done
Process died with signal 11: 'Segmentation fault'
Forcing core dumps of ranks 1992, 61, 218, 267, 271, 361, 510, 86, 1258, 8, 126, 293, 106, 373, 688, 59, 129, 145, 208, 17
The line numbers are not in Ray, but in the code that prints the stack.