Closed bachev closed 10 years ago
Hi,
With 2.2.0, it works for me:
Command:
rm -rf popo8
mpiexec -n 1 ./Ray -mini-ranks-per-rank 8 -o popo8 \ -p data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \ data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq \
(this is from https://github.com/sebhtml/Ray-TestSuite/blob/master/robustness-tests/test-mini-ranks.sh )
I can see the 900% CPU t utilization here:
Tasks: 885 total, 2 running, 881 sleeping, 0 stopped, 2 zombie Cpu(s): 32.0%us, 4.8%sy, 0.0%ni, 63.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32874744k total, 8439424k used, 24435320k free, 13720k buffers Swap: 50331640k total, 0k used, 50331640k free, 4763576k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12944 boisver1 20 0 2019m 1.7g 4564 R 897.0 5.5 18:39.62 Ray <=======================
With 2.3.0, I confirm that the bug is reproducible. I added it to the milestone 2.3.1.
I get:
[cp1833:14296] [ 0] /lib64/libpthread.so.0() [0x365260f500] [cp1833:14296] [ 1] ./Ray(_ZN11DirtyBuffer9getBufferEv+0) [0x5a0880] [cp1833:14296] [ 2] ./Ray(_ZN13RingAllocator14registerBufferEPv+0x40) [0x59f930] [cp1833:14296] [ 3] ./Ray(_ZN15MessagesHandler23sendMessagesForMiniRankEP12MessageQueueP13RingAllocatori+0x4b) [0x5b82ab] [cp1833:14296] [ 4] ./Ray(_ZN15MessagesHandler36sendAndReceiveMessagesForRankProcessEPP11ComputeCoreiPb+0x9c) [0x5b873c] [cp1833:14296] [ 5] ./Ray(main+0x1d5) [0x484405] [cp1833:14296] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x365221ecdd] [cp1833:14296] [ 7] ./Ray() [0x481631] [cp1833:14296] * End of error message * Rank 1: assembler memory usage: 734136 KiB Rank 1: Rank= 1 Size= 8 ProcessIdentifier= 14296
Trace:
(gdb) bt
at RayPlatform/communication/MessagesHandler.cpp:975
at RayPlatform/communication/MessagesHandler.cpp:230
at RayPlatform/communication/MessagesHandler.cpp:77
There is a buffer problem:
Ray: RayPlatform/memory/RingAllocator.cpp:265: int RingAllocator::getBufferHandle(void*): Assertion `bufferValue >= originValue' failed. Error: buffer is too low: 0x2b5538d60ff8 but base is 0x2b5538e5c040
(difference: 0xFB048)
Full stack trace:
(gdb) bt
at RayPlatform/communication/MessagesHandler.cpp:976
at RayPlatform/communication/MessagesHandler.cpp:228
at RayPlatform/communication/MessagesHandler.cpp:77
Hi,
I (probably) found the issue.
In the message handler code, this was used to register the dirty buffer:
request = this->registerMessageBuffer(buffer, m_rank, destination, tag, outboxBufferAllocator);
However, buffer is not thread-safe.
Working on a patch now.
3 Mixed messages with tag 17:
Ray: RayPlatform/communication/Message.cpp:533: void Message::runAssertions(int, bool, bool): Assertion m_routingSource == -1' failed. [cp2035:15087] *** Process received signal *** [cp2035:15087] Signal: Aborted (6) [cp2035:15087] Signal code: (-6) Ray: RayPlatform/communication/Message.cpp:533: void Message::runAssertions(int, bool, bool): Assertion
m_routingSource == -1' failed.
Source: 0 Destination: 0 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActor: 0 RoutingSource: 0 RoutingDestination: 0 MiniRankSource: 0 MiniRankDestination: 0 Buffer: 0x2b5a88d96160 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00Source: 0 Destination: 1 RealTag: 17Source: 0x000 Destination: 2 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActor: 0 RoutingSource: 0 RoutingDestination: 0 MiniRankSource: 0x00 0x00 0x00 Count: 0x00 0x005 0x00 Overlay: 0x00 0x00021 0x00 Bytes: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Buffer: 0x1213ca0 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Original messages:
[Communication] 27 microseconds, SEND Source: 0 Destination: 0 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo r: 0 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 0 Buffer: 0x2b5a98f67000 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
[Communication] 27 microseconds, SEND Source: 0 Destination: 1 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo r: 1 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 1 Buffer: 0x2b5a98f67fc0 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
[Communication] 27 microseconds, SEND Source: 0 Destination: 2 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo r: 2 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 2 Buffer: 0x2b5a98f68f80 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Original message buffer (analysis) from 0 to 2:
Message has 44 bytes header is always 28 bytes data: 16 bytes (first) AMD Opteron is Little Endian too!
0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 // kmer length is 21 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 // application data (unknown) 0x00 0x00 0x00 0x00 // source actor 0x02 0x00 0x00 0x00 // destination actor 0xff 0xff 0xff 0xff // routing source (-1) 0xff 0xff 0xff 0xff // routing destination (-1) 0x00 0x00 0x00 0x00 // minirank source (0) 0x02 0x00 0x00 0x00 // minirank destination (2) 0x00 0x00 0x00 0x00 // CRC32 (not active)
Received message:
44 bytes: (no header) 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
I think this is the only related issue:
Rank 1 registered its seeds
VirtualProcessor: completed jobs: 0
Rank 1 : VirtualCommunicator (service provided by VirtualCommunicator): 0 virtual messages generated 0 real messages (VirtualProcessor: completed jobs: 0VirtualProcessor: completed jobs:
Rank 2 : VirtualCommunicator (service provided by VirtualCommunicator): 0 virtual messages generated 0 real messages (0
Rank 0 : VirtualCommunicator (service provided by VirtualCommunicator): 00 virtual messages generated 0 real messages (%)
0%)
0%)
Rank 0 freed 20971520 bytes from the path memory pool (chunks: 5)
Rank 1 freed 20971520 bytes from the path memory pool (chunks: 5)
Rank 2 freed 20971520 bytes from the path memory pool (chunks: 5)
Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertion m_destination < size' failed. Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertion
m_destination < size' failed.
Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertion `m_destination < size' failed.
Tag is 224.
(gdb) bt
) at RayPlatform/communication/Message.cpp:499
(gdb) f 4
) at RayPlatform/communication/Message.cpp:499 499 assert(m_destination < size); (gdb) info locals __PRETTY_FUNCTION* = "void Message::runAssertions(int, bool, bool)" (gdb) p this->m_buffer $1 = (void ) 0x2b1cfafbbac0 (gdb) p this->m_bytes $2 = 28 (gdb) p this->m_destination $3 = 3 (gdb) p this->m_source $4 = 0 (gdb) p this->m_tag $5 = 224 (gdb) p this->m_miniRankSource $6 = 0 (gdb) p this->m_miniRankDestination $7 = 3 (gdb) quit
When using mpiexec -n 1 /opt/biosw/ray/Ray -mini-ranks-per-rank 3 -o test -p f1.fastq f2.fastq -k 31
I get segfaults (see below) when running Ray 2.3.0 and Ray 2.2.0, Reproduced on Kubuntu 9.10 and 12.04 (two different machines).
Best, B.
Rank 0 wrote test/RayCommand.txt
k-mer length: 31 Rank 1: assembler memory usage: 257772 KiB Rank 2: assembler memory usage: 323312 KiB Rank 0: assembler memory usage: 405240 KiB Rank 1: assembler memory usage: 470780 KiB Rank 1: Rank= 1 Size= 3 ProcessIdentifier= 10908 Rank 2: assembler memory usage: 470780 KiB Rank 2: Rank= 2 Size= 3 ProcessIdentifier= 10908 Rank 0: assembler memory usage: 470780 KiB Rank 0: Rank= 0 Size= 3 ProcessIdentifier= 10908 Rank 0: testing the network, please wait...
[arcadia:10908] * Process received signal * [arcadia:10908] Signal: Segmentation fault (11) [arcadia:10908] Signal code: Address not mapped (1) [arcadia:10908] Failing at address: 0x20abc24e0 [arcadia:10908] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f5e90f23cb0] [arcadia:10908] [ 1] /opt/biosw/ray-2.3.0/Ray(_ZN11DirtyBuffer9getBufferEv+0) [0x5976d0] [arcadia:10908] [ 2] /opt/biosw/ray-2.3.0/Ray(_ZN13RingAllocator14registerBufferEPv+0x31) [0x596761] [arcadia:10908] [ 3] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler23sendMessagesForMiniRankEP12MessageQueueP13RingAllocatori+0x4d) [0x5ae73d] [arcadia:10908] [ 4] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler36sendAndReceiveMessagesForRankProcessEPP11ComputeCoreiPb+0x9c) [0x5aebac] [arcadia:10908] [ 5] /opt/biosw/ray-2.3.0/Ray(main+0x1ed) [0x47182d] [arcadia:10908] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f5e90b7576d] [arcadia:10908] [ 7] /opt/biosw/ray-2.3.0/Ray() [0x4732d1]
[arcadia:10908] * End of error message *
mpiexec noticed that process rank 0 with PID 10908 on node arcadia exited on signal 11 (Segmentation fault).