rodarima / cpic

Particle in Cell simulation of plasma in C
GNU General Public License v3.0
1 stars 1 forks source link

Deadlock with 4 processes #12

Closed rodarima closed 4 years ago

rodarima commented 4 years ago

Stuck in recv_plist_y

(gdb) ea
Thr  #   Function               Source
1    2   term_handler()         src/cpic.c:47
3    8   recv_plist_y()         src/comm_plasma.c:957
6    8   recv_plist_y()         src/comm_plasma.c:957
7    8   recv_plist_y()         src/comm_plasma.c:957
9    8   recv_plist_y()         src/comm_plasma.c:957
12   8   recv_plist_y()         src/comm_plasma.c:957
13   8   recv_plist_y()         src/comm_plasma.c:957
14   6   sim_pre_step()         src/sim.c:201
19   8   recv_plist_y()         src/comm_plasma.c:957
22   8   recv_plist_y()         src/comm_plasma.c:957
23   8   recv_plist_y()         src/comm_plasma.c:957
24   8   recv_plist_y()         src/comm_plasma.c:957
28   8   recv_plist_y()         src/comm_plasma.c:957
31   8   recv_plist_y()         src/comm_plasma.c:957
34   8   recv_plist_y()         src/comm_plasma.c:957
36   8   recv_plist_y()         src/comm_plasma.c:957
38   8   recv_plist_y()         src/comm_plasma.c:957
39   8   recv_plist_y()         src/comm_plasma.c:957

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f9865ffa740 (LWP 1875018))]
#0  0x00007f9876144cf5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0

(gdb) bt
#0  0x00007f9876144cf5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00007f9875e16ea1 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>)
    at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2  std::condition_variable::wait (this=<optimized out>, __lock=...) at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00007f9875a3befc in TaskBlocking::taskBlocks(WorkerThread*, Task*, ThreadManagerPolicy::thread_run_inline_policy_t) ()
   from /usr/lib/libnanos6-optimized-linear-regions-fragmented.so
#4  0x00007f9875a31774 in nanos6_block_current_task () from /usr/lib/libnanos6-optimized-linear-regions-fragmented.so
#5  0x00007f9876161b89 in nanos6_block_current_task (blocking_context=0x7f97f4206790) at loader/indirect-symbols/blocking.c:34
#6  0x00007f9876a344bb in ?? () from /usr/lib/libtampi-c.so.0
#7  0x00007f9876a39a53 in MPI_Recv () from /usr/lib/libtampi-c.so.0
#8  0x0000000000407ff2 in recv_plist_y (sim=0x7f97f416a8c0, l=0x7f97f4172dd0, src=3, ic=11) at src/comm_plasma.c:957
#9  0x000000000040792e in recv_pchunk_y (sim=0x7f97f416a8c0, c=0x7f97f416e080) at src/comm_plasma.c:1021
#10 0x0000000000408768 in nanos6_unpacked_task_region_exchange_plasma_y1 ()
#11 0x000000000040879c in nanos6_ol_task_region_exchange_plasma_y1 ()
#12 0x00007f9875a3e11e in ExecutionWorkflow::executeTask(Task*, ComputePlace*, MemoryPlace*) ()
   from /usr/lib/libnanos6-optimized-linear-regions-fragmented.so
#13 0x00007f9875a1b4b8 in WorkerThread::handleTask(CPU*) () from /usr/lib/libnanos6-optimized-linear-regions-fragmented.so
#14 0x00007f9875a1bc1b in WorkerThread::body() () from /usr/lib/libnanos6-optimized-linear-regions-fragmented.so
#15 0x00007f9875a11c91 in kernel_level_thread_body_wrapper(void*) () from /usr/lib/libnanos6-optimized-linear-regions-fragmented.so
#16 0x00007f987613e46f in start_thread () from /usr/lib/libpthread.so.0
#17 0x00007f987606e3d3 in clone () from /usr/lib/libc.so.6
(gdb)
rodarima commented 4 years ago

Reproduced with 1 chunk 4 processes. Process 1 finishes reception phase, stuck in sending.

xeon07% mpirun -n 4 ./cpic conf/simd.conf
ENTRANDO EN MAIN
ENTRANDO EN MAIN
ENTRANDO EN MAIN
ENTRANDO EN MAIN
P0 src/cpic.c:107 : Using TAMPI with 4 processors
Initializing simulation
P0 src/sim.c:124 : Sampling enabled with relative error limit 1.000000e-03
P0 src/sim.c:178 : Global number of points (256 256 1)
P0 src/output.c:68  : No output path specified, output will not be saved
P1 src/sim.c:124 : Sampling enabled with relative error limit 1.000000e-03
P1 src/sim.c:178 : Global number of points (256 256 1)
P1 src/output.c:68  : No output path specified, output will not be saved
Initializing simulation
P1 src/sim.c:193 : begin sim_pre_step
P0 src/sim.c:193 : begin sim_pre_step
Initializing simulation
P2 src/sim.c:124 : Sampling enabled with relative error limit 1.000000e-03
P2 src/sim.c:178 : Global number of points (256 256 1)
P2 src/output.c:68  : No output path specified, output will not be saved
Initializing simulation
P3 src/sim.c:124 : Sampling enabled with relative error limit 1.000000e-03
P3 src/sim.c:178 : Global number of points (256 256 1)
P3 src/output.c:68  : No output path specified, output will not be saved
P1 src/comm_plasma.c:960 : P1 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block0 proc1 -> proc0, chunk ic=0
[33488896] Receiving r0.block0 proc0 -> proc1 into ic=0
P3 src/sim.c:193 : begin sim_pre_step
P2 src/sim.c:193 : begin sim_pre_step
P0 src/comm_plasma.c:960 : P0 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block0 proc0 -> proc3, chunk ic=0
[33488896] Receiving r0.block0 proc3 -> proc0 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block0 proc0 -> proc3, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:920 : No more blocks to send to dst=3 chunk ic=0
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block0 proc0 -> proc1, chunk ic=0
P3 src/comm_plasma.c:960 : P3 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block0 proc3 -> proc2, chunk ic=0
[33488896] Receiving r0.block0 proc2 -> proc3 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block0 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block1 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block0 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block1 proc0 -> proc1 into ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block1 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block2 proc0 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block1 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block2 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block2 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block3 proc0 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block2 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block3 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block3 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block4 proc0 -> proc1 into ic=0
P2 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block0 proc2 -> proc1, chunk ic=0
P2 src/comm_plasma.c:960 : [33488896] Receiving r0.block0 proc1 -> proc2 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block3 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block4 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block4 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block5 proc0 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block4 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block5 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block5 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block6 proc0 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block5 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block6 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block6 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block7 proc0 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block6 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block7 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block7 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block8 proc0 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block7 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block8 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block8 proc0 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r0.block9 proc0 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block8 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block9 proc0 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r0.block9 proc0 -> proc1, ic=0 COMPLETED b->n=119
P1 src/comm_plasma.c:960 : [33488896] Receiving r1.block0 proc2 -> proc1 into ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r1.block0 proc2 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r1.block1 proc2 -> proc1 into ic=0
P0 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block9 proc0 -> proc1, chunk ic=0 COMPLETED!
P0 src/comm_plasma.c:920 : No more blocks to send to dst=1 chunk ic=0
P2 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block0 proc2 -> proc1, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block1 proc2 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r1.block1 proc2 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r1.block2 proc2 -> proc1 into ic=0
P2 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block1 proc2 -> proc1, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block2 proc2 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r1.block2 proc2 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r1.block3 proc2 -> proc1 into ic=0
P2 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block2 proc2 -> proc1, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block3 proc2 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r1.block3 proc2 -> proc1, ic=0 COMPLETED b->n=1024
P2 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block3 proc2 -> proc1, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block4 proc2 -> proc1, chunk ic=0
P1 src/comm_plasma.c:960 : [33488896] Receiving r1.block4 proc2 -> proc1 into ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r1.block4 proc2 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r1.block5 proc2 -> proc1 into ic=0
P2 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block4 proc2 -> proc1, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block5 proc2 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r1.block5 proc2 -> proc1, ic=0 COMPLETED b->n=1024
P1 src/comm_plasma.c:960 : [33488896] Receiving r1.block6 proc2 -> proc1 into ic=0
P2 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block5 proc2 -> proc1, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q0[Y].block6 proc2 -> proc1, chunk ic=0
P1 src/comm_plasma.c:963 : [33488896] Received r1.block6 proc2 -> proc1, ic=0 COMPLETED b->n=182
P2 src/comm_plasma.c:918 : [33488896] Sending q0[Y].block6 proc2 -> proc1, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:920 : No more blocks to send to dst=1 chunk ic=0
P2 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block0 proc2 -> proc3, chunk ic=0
P3 src/comm_plasma.c:963 : [33488896] Received r0.block0 proc2 -> proc3, ic=0 COMPLETED b->n=1024
P2 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block0 proc2 -> proc3, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block1 proc2 -> proc3, chunk ic=0
P3 src/comm_plasma.c:960 : [33488896] Receiving r0.block1 proc2 -> proc3 into ic=0
P3 src/comm_plasma.c:963 : [33488896] Received r0.block1 proc2 -> proc3, ic=0 COMPLETED b->n=1024
P3 src/comm_plasma.c:960 : [33488896] Receiving r0.block2 proc2 -> proc3 into ic=0
P2 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block1 proc2 -> proc3, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block2 proc2 -> proc3, chunk ic=0
P3 src/comm_plasma.c:963 : [33488896] Received r0.block2 proc2 -> proc3, ic=0 COMPLETED b->n=1024
P3 src/comm_plasma.c:960 : [33488896] Receiving r0.block3 proc2 -> proc3 into ic=0
P2 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block2 proc2 -> proc3, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:915 : [33488896] Sending q1[Y].block3 proc2 -> proc3, chunk ic=0
P3 src/comm_plasma.c:963 : [33488896] Received r0.block3 proc2 -> proc3, ic=0 COMPLETED b->n=47
P3 src/comm_plasma.c:960 : [33488896] Receiving r1.block0 proc0 -> proc3 into ic=0
P3 src/comm_plasma.c:963 : [33488896] Received r1.block0 proc0 -> proc3, ic=0 COMPLETED b->n=0
P2 src/comm_plasma.c:918 : [33488896] Sending q1[Y].block3 proc2 -> proc3, chunk ic=0 COMPLETED!
P2 src/comm_plasma.c:920 : No more blocks to send to dst=3 chunk ic=0