sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Segfault when using mini-ranks-per-rank #220

Closed bachev closed 10 years ago

bachev commented 10 years ago

When using mpiexec -n 1 /opt/biosw/ray/Ray -mini-ranks-per-rank 3 -o test -p f1.fastq f2.fastq -k 31

I get segfaults (see below) when running Ray 2.3.0 and Ray 2.2.0, Reproduced on Kubuntu 9.10 and 12.04 (two different machines).

Best, B.

Rank 0 wrote test/RayCommand.txt

k-mer length: 31 Rank 1: assembler memory usage: 257772 KiB Rank 2: assembler memory usage: 323312 KiB Rank 0: assembler memory usage: 405240 KiB Rank 1: assembler memory usage: 470780 KiB Rank 1: Rank= 1 Size= 3 ProcessIdentifier= 10908 Rank 2: assembler memory usage: 470780 KiB Rank 2: Rank= 2 Size= 3 ProcessIdentifier= 10908 Rank 0: assembler memory usage: 470780 KiB Rank 0: Rank= 0 Size= 3 ProcessIdentifier= 10908 Rank 0: testing the network, please wait...

[arcadia:10908] * Process received signal * [arcadia:10908] Signal: Segmentation fault (11) [arcadia:10908] Signal code: Address not mapped (1) [arcadia:10908] Failing at address: 0x20abc24e0 [arcadia:10908] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f5e90f23cb0] [arcadia:10908] [ 1] /opt/biosw/ray-2.3.0/Ray(_ZN11DirtyBuffer9getBufferEv+0) [0x5976d0] [arcadia:10908] [ 2] /opt/biosw/ray-2.3.0/Ray(_ZN13RingAllocator14registerBufferEPv+0x31) [0x596761] [arcadia:10908] [ 3] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler23sendMessagesForMiniRankEP12MessageQueueP13RingAllocatori+0x4d) [0x5ae73d] [arcadia:10908] [ 4] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler36sendAndReceiveMessagesForRankProcessEPP11ComputeCoreiPb+0x9c) [0x5aebac] [arcadia:10908] [ 5] /opt/biosw/ray-2.3.0/Ray(main+0x1ed) [0x47182d] [arcadia:10908] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f5e90b7576d] [arcadia:10908] [ 7] /opt/biosw/ray-2.3.0/Ray() [0x4732d1]

[arcadia:10908] * End of error message *

mpiexec noticed that process rank 0 with PID 10908 on node arcadia exited on signal 11 (Segmentation fault).

sebhtml commented 10 years ago

Hi,

With 2.2.0, it works for me:

Command:

!/bin/bash

rm -rf popo8

mpiexec -n 1 ./Ray -mini-ranks-per-rank 8 -o popo8 \ -p data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \ data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq \

(this is from https://github.com/sebhtml/Ray-TestSuite/blob/master/robustness-tests/test-mini-ranks.sh )

I can see the 900% CPU t utilization here:

Tasks: 885 total, 2 running, 881 sleeping, 0 stopped, 2 zombie Cpu(s): 32.0%us, 4.8%sy, 0.0%ni, 63.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32874744k total, 8439424k used, 24435320k free, 13720k buffers Swap: 50331640k total, 0k used, 50331640k free, 4763576k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12944 boisver1 20 0 2019m 1.7g 4564 R 897.0 5.5 18:39.62 Ray <=======================

sebhtml commented 10 years ago

With 2.3.0, I confirm that the bug is reproducible. I added it to the milestone 2.3.1.

I get:

[cp1833:14296] [ 0] /lib64/libpthread.so.0() [0x365260f500] [cp1833:14296] [ 1] ./Ray(_ZN11DirtyBuffer9getBufferEv+0) [0x5a0880] [cp1833:14296] [ 2] ./Ray(_ZN13RingAllocator14registerBufferEPv+0x40) [0x59f930] [cp1833:14296] [ 3] ./Ray(_ZN15MessagesHandler23sendMessagesForMiniRankEP12MessageQueueP13RingAllocatori+0x4b) [0x5b82ab] [cp1833:14296] [ 4] ./Ray(_ZN15MessagesHandler36sendAndReceiveMessagesForRankProcessEPP11ComputeCoreiPb+0x9c) [0x5b873c] [cp1833:14296] [ 5] ./Ray(main+0x1d5) [0x484405] [cp1833:14296] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x365221ecdd] [cp1833:14296] [ 7] ./Ray() [0x481631] [cp1833:14296] * End of error message * Rank 1: assembler memory usage: 734136 KiB Rank 1: Rank= 1 Size= 8 ProcessIdentifier= 14296

sebhtml commented 10 years ago

Trace:

(gdb) bt

0 DirtyBuffer::getBuffer (this=0x2b8e28671e80) at RayPlatform/memory/DirtyBuffer.cpp:70

1 0x00000000005c5242 in RingAllocator::registerBuffer (this=0x2b94a18f5220, buffer=0x2b94d14f5ff8) at RayPlatform/memory/RingAllocator.cpp:491

2 0x00000000005e34ad in registerMessageBuffer (this=0x7fff4b0ee4f8, outbox=0x2b94a18f5208, outboxBufferAllocator=0x2b94a18f5220, miniRanksPerRank=8)

at RayPlatform/communication/MessagesHandler.cpp:975

3 MessagesHandler::sendMessagesForMiniRank (this=0x7fff4b0ee4f8, outbox=0x2b94a18f5208, outboxBufferAllocator=0x2b94a18f5220, miniRanksPerRank=8)

at RayPlatform/communication/MessagesHandler.cpp:230

4 0x00000000005e38f4 in MessagesHandler::sendAndReceiveMessagesForRankProcess (this=0x7fff4b0ee4f8, cores=0x7fff4b0ef900, miniRanksPerRank=8, communicate=0x7fff4b0ee4f0)

at RayPlatform/communication/MessagesHandler.cpp:77

5 0x0000000000487c14 in startMiniRanks (this=0x7fff4b0ee4e0) at RayPlatform/RayPlatform/core/RankProcess.h:288

6 RankProcess::run (this=0x7fff4b0ee4e0) at RayPlatform/RayPlatform/core/RankProcess.h:232

7 0x0000000000487f67 in main (argc=8, argv=0x7fff4b0efc08) at code/application_core/ray_main.cpp:32

sebhtml commented 10 years ago

There is a buffer problem:

Ray: RayPlatform/memory/RingAllocator.cpp:265: int RingAllocator::getBufferHandle(void*): Assertion `bufferValue >= originValue' failed. Error: buffer is too low: 0x2b5538d60ff8 but base is 0x2b5538e5c040

(difference: 0xFB048)

Full stack trace:

(gdb) bt

0 0x00000036522328a5 in raise () from /lib64/libc.so.6

1 0x000000365223400d in abort () from /lib64/libc.so.6

2 0x000000365222ba1e in __assert_fail_base () from /lib64/libc.so.6

3 0x000000365222bae0 in __assert_fail () from /lib64/libc.so.6

4 0x00000000005c595f in getBufferHandle (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:265

5 getBufferHandle (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:487

6 markBufferAsDirty (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:234

7 RingAllocator::registerBuffer (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:526

8 0x00000000005e3d8f in registerMessageBuffer (this=0x7fff2faa8d38, outbox=0x2b962586d208, outboxBufferAllocator=0x2b962586d220, miniRanksPerRank=3)

at RayPlatform/communication/MessagesHandler.cpp:976

9 MessagesHandler::sendMessagesForMiniRank (this=0x7fff2faa8d38, outbox=0x2b962586d208, outboxBufferAllocator=0x2b962586d220, miniRanksPerRank=3)

at RayPlatform/communication/MessagesHandler.cpp:228

10 0x00000000005e41d4 in MessagesHandler::sendAndReceiveMessagesForRankProcess (this=0x7fff2faa8d38, cores=0x7fff2faaa140, miniRanksPerRank=3, communicate=0x7fff2faa8d30)

at RayPlatform/communication/MessagesHandler.cpp:77

11 0x0000000000487db4 in startMiniRanks (this=0x7fff2faa8d20) at RayPlatform/RayPlatform/core/RankProcess.h:288

12 RankProcess::run (this=0x7fff2faa8d20) at RayPlatform/RayPlatform/core/RankProcess.h:232

13 0x0000000000488107 in main (argc=8, argv=0x7fff2faaa448) at code/application_core/ray_main.cpp:32

sebhtml commented 10 years ago

Hi,

I (probably) found the issue.

In the message handler code, this was used to register the dirty buffer:

request = this->registerMessageBuffer(buffer, m_rank, destination, tag, outboxBufferAllocator);

However, buffer is not thread-safe.

Working on a patch now.

sebhtml commented 10 years ago

3 Mixed messages with tag 17:

Ray: RayPlatform/communication/Message.cpp:533: void Message::runAssertions(int, bool, bool): Assertion m_routingSource == -1' failed. [cp2035:15087] *** Process received signal *** [cp2035:15087] Signal: Aborted (6) [cp2035:15087] Signal code: (-6) Ray: RayPlatform/communication/Message.cpp:533: void Message::runAssertions(int, bool, bool): Assertionm_routingSource == -1' failed. Source: 0 Destination: 0 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActor: 0 RoutingSource: 0 RoutingDestination: 0 MiniRankSource: 0 MiniRankDestination: 0 Buffer: 0x2b5a88d96160 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00Source: 0 Destination: 1 RealTag: 17Source: 0x000 Destination: 2 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActor: 0 RoutingSource: 0 RoutingDestination: 0 MiniRankSource: 0x00 0x00 0x00 Count: 0x00 0x005 0x00 Overlay: 0x00 0x00021 0x00 Bytes: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Buffer: 0x1213ca0 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Original messages:

[Communication] 27 microseconds, SEND Source: 0 Destination: 0 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo r: 0 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 0 Buffer: 0x2b5a98f67000 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

[Communication] 27 microseconds, SEND Source: 0 Destination: 1 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo r: 1 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 1 Buffer: 0x2b5a98f67fc0 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00

[Communication] 27 microseconds, SEND Source: 0 Destination: 2 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo r: 2 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 2 Buffer: 0x2b5a98f68f80 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Original message buffer (analysis) from 0 to 2:

Message has 44 bytes header is always 28 bytes data: 16 bytes (first) AMD Opteron is Little Endian too!

0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 // kmer length is 21 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 // application data (unknown) 0x00 0x00 0x00 0x00 // source actor 0x02 0x00 0x00 0x00 // destination actor 0xff 0xff 0xff 0xff // routing source (-1) 0xff 0xff 0xff 0xff // routing destination (-1) 0x00 0x00 0x00 0x00 // minirank source (0) 0x02 0x00 0x00 0x00 // minirank destination (2) 0x00 0x00 0x00 0x00 // CRC32 (not active)

Received message:

44 bytes: (no header) 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

sebhtml commented 10 years ago

I think this is the only related issue:

Rank 1 registered its seeds VirtualProcessor: completed jobs: 0 Rank 1 : VirtualCommunicator (service provided by VirtualCommunicator): 0 virtual messages generated 0 real messages (VirtualProcessor: completed jobs: 0VirtualProcessor: completed jobs: Rank 2 : VirtualCommunicator (service provided by VirtualCommunicator): 0 virtual messages generated 0 real messages (0 Rank 0 : VirtualCommunicator (service provided by VirtualCommunicator): 00 virtual messages generated 0 real messages (%) 0%) 0%) Rank 0 freed 20971520 bytes from the path memory pool (chunks: 5) Rank 1 freed 20971520 bytes from the path memory pool (chunks: 5) Rank 2 freed 20971520 bytes from the path memory pool (chunks: 5) Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertion m_destination < size' failed. Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertionm_destination < size' failed. Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertion `m_destination < size' failed.

Related?: https://github.com/sebhtml/ray/issues/222

sebhtml commented 10 years ago

Tag is 224.

(gdb) bt

0 0x0000003fa48328a5 in raise () from /lib64/libc.so.6

1 0x0000003fa4834085 in abort () from /lib64/libc.so.6

2 0x0000003fa482ba1e in **assert_fail_base () from /lib64/libc.so.6

3 0x0000003fa482bae0 in __assert_fail () from /lib64/libc.so.6

4 0x00000000005e3c1f in Message::runAssertions (this=0x2b1ceae91860, size=3, routing=Unhandled dwarf expression opcode 0xf3

) at RayPlatform/communication/Message.cpp:499

5 0x00000000005e9334 in testMessage (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:2389

6 ComputeCore::sendMessages (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:807

7 0x00000000005ed6d9 in ComputeCore::runWithProfiler (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:555

8 0x00000000005eea48 in ComputeCore::run (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:198

9 0x000000000048a4bc in Machine::start (this=0x2b1ce7260040) at code/application_core/Machine.cpp:560

10 0x00000000004841e9 in Rank_startMiniRank (object=0x2b1ce7260040) at RayPlatform/RayPlatform/core/RankProcess.h:208

11 0x0000003fa4c07851 in start_thread () from /lib64/libpthread.so.0

12 0x0000003fa48e890d in clone () from /lib64/libc.so.6

(gdb) f 4

4 0x00000000005e3c1f in Message::runAssertions (this=0x2b1ceae91860, size=3, routing=Unhandled dwarf expression opcode 0xf3

) at RayPlatform/communication/Message.cpp:499 499 assert(m_destination < size); (gdb) info locals __PRETTY_FUNCTION* = "void Message::runAssertions(int, bool, bool)" (gdb) p this->m_buffer $1 = (void ) 0x2b1cfafbbac0 (gdb) p this->m_bytes $2 = 28 (gdb) p this->m_destination $3 = 3 (gdb) p this->m_source $4 = 0 (gdb) p this->m_tag $5 = 224 (gdb) p this->m_miniRankSource $6 = 0 (gdb) p this->m_miniRankDestination $7 = 3 (gdb) quit

sebhtml commented 10 years ago

fixed by:

https://github.com/sebhtml/RayPlatform/commit/ce606c493dc110d8a4877b24f41152a6b0d60e76

https://github.com/sebhtml/RayPlatform/commit/15efbd70cda13f46f593b5a75be9b3f3ee683b81