openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.13k stars 422 forks source link

[mtt] hang in several tests on jazz (roce) #3708

Closed amaslenn closed 5 years ago

amaslenn commented 5 years ago

Configuration

OMPI: 4.0.2a1
MOFED: MLNX_OFED_LINUX-4.5-1.0.1.0
Module: hpcx-gcc (2019-06-15)
Test module: mtt-tests/hpcx-gcc
Nodes: jazz x16 (ppn=28(x16), nodelist=jazz[05-06,09-14,17-24])

MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20190615_192018_4712_23348_jazz05/html/test_stdout_oOv85e.txt

Doesn't always reproduce, need to run it in a loop. Couldn't reproduce with debug env module.

Cmd: mpirun -np 60 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_3:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,self,sm -x UCX_DC_MLX5_TX_POLICY=rand -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /hpc/local/benchmarks/hpcx_install_2019-06-15/mtt-tests-gcc/installs/AXqd/tests/intel/intel_tests/src/MPI_Alltoallv_f

Output:

...
node=jazz05, pid=61967:
Thread 4 (Thread 0x7f5d82bf5700 (LWP 61971)):
#0  0x00007f5d85c32923 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f5d855f0f53 in epoll_dispatch (base=0x134c5e0, tv=<optimized out>) at epoll.c:407
#2  0x00007f5d855f49a0 in opal_libevent2022_event_base_loop (base=0x134c5e0, flags=flags@entry=1) at event.c:1630
#3  0x00007f5d855afffe in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f5d85f04e25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f5d85c3234d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7f5d7fc0e700 (LWP 61974)):
#0  0x00007f5d85c32923 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f5d855f0f53 in epoll_dispatch (base=0x13a67a0, tv=<optimized out>) at epoll.c:407
#2  0x00007f5d855f49a0 in opal_libevent2022_event_base_loop (base=0x13a67a0, flags=flags@entry=1) at event.c:1630
#3  0x00007f5d81b3a8ee in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007f5d85f04e25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f5d85c3234d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7f5d754c2700 (LWP 61996)):
#0  0x00007f5d85c32923 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f5d7c84cf2c in ucs_async_thread_func (arg=0x1448960) at async/thread.c:93
#2  0x00007f5d85f04e25 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007f5d85c3234d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7f5d874b1740 (LWP 61967)):
#0  0x00007f5d7d1d3f10 in ucp_worker_progress@plt () from /hpc/local/benchmarks/hpcx_install_2019-06-15/hpcx-gcc-redhat7.4/ompi/lib/openmpi/mca_pml_ucx.so
#1  0x00007f5d7d1d4717 in mca_pml_ucx_progress () at pml_ucx.c:510
#2  0x00007f5d855aa17c in opal_progress () at runtime/opal_progress.c:231
#3  0x00007f5d865b6945 in ompi_request_wait_completion (req=0x14e0de8) at ../ompi/request/request.h:415
#4  ompi_request_default_wait (req_ptr=0x7fff795b63b0, status=0x7fff795b63c0) at request/req_wait.c:42
#5  0x00007f5d865f11e9 in ompi_coll_base_sendrecv_actual (sendbuf=0x61de01 <send_buffer.2004+33>, scount=0, sdatatype=sdatatype@entry=0x7f5d8683f8e0 <ompi_mpi_integer1>, dest=22, stag=stag@entry=-14, recvbuf=<optimized out>, rcount=0, rdatatype=rdatatype@entry=0x7f5d8683f8e0 <ompi_mpi_integer1>, source=source@entry=42, rtag=rtag@entry=-14, comm=comm@entry=0x7f5d8684bcd0 <ompi_mpi_comm_world>, status=status@entry=0x0) at base/coll_base_util.c:59
#6  0x00007f5d865f6913 in ompi_coll_base_sendrecv (stag=-14, rtag=-14, status=0x0, myid=32, comm=0x7f5d8684bcd0 <ompi_mpi_comm_world>, source=42, rdatatype=0x7f5d8683f8e0 <ompi_mpi_integer1>, rcount=<optimized out>, recvbuf=<optimized out>, dest=22, sdatatype=0x7f5d8683f8e0 <ompi_mpi_integer1>, scount=<optimized out>, sendbuf=<optimized out>) at base/coll_base_util.h:67
#7  ompi_coll_base_alltoallv_intra_pairwise (sbuf=0x61dde0 <send_buffer.2004>, scounts=0x7fff795b7230, sdisps=0x7fff795b7030, sdtype=0x7f5d8683f8e0 <ompi_mpi_integer1>, rbuf=0x63ede0 <recv_buffer.1997>, rcounts=0x7fff795b7630, rdisps=0x7fff795b7430, rdtype=0x7f5d8683f8e0 <ompi_mpi_integer1>, comm=0x7f5d8684bcd0 <ompi_mpi_comm_world>, module=0x14cecf0) at base/coll_base_alltoallv.c:162
#8  0x00007f5d865c92c1 in PMPI_Alltoallv (sendbuf=sendbuf@entry=0x61dde0 <send_buffer.2004>, sendcounts=sendcounts@entry=0x7fff795b7230, sdispls=sdispls@entry=0x7fff795b7030, sendtype=sendtype@entry=0x7f5d8683f8e0 <ompi_mpi_integer1>, recvbuf=<optimized out>, recvcounts=recvcounts@entry=0x7fff795b7630, rdispls=rdispls@entry=0x7fff795b7430, recvtype=recvtype@entry=0x7f5d8683f8e0 <ompi_mpi_integer1>, comm=comm@entry=0x7f5d8684bcd0 <ompi_mpi_comm_world>) at palltoallv.c:129
#9  0x00007f5d868a10c8 in ompi_alltoallv_f (sendbuf=0x61dde0 <send_buffer.2004> " !  #  %  '  )  +  -  /  1  3  5  7  9  ;  =  ?  A  C  E  G  I  K  M  O  Q  S  U  W  Y  [", ' ' <repeats 111 times>..., sendcounts=0x7fff795b7230, sdispls=0x7fff795b7030, sendtype=<optimized out>, recvbuf=<optimized out>, recvcounts=0x7fff795b7630, rdispls=0x7fff795b7430, recvtype=0x61d3c0 <mpit_+576>, comm=0x7fff795b7a88, ierr=0x7fff795b7a60) at palltoallv_f.c:97
#10 0x0000000000402006 in MAIN__ () at MPI_Alltoallv_f.F:261
...

I saw similar hangs (couldn't reproduce) in at least atest: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20190617_051704_48747_23377_jazz09/html/test_stdout_i0t9L3.txt

Cmd: mpirun -np 56 --display-map -mca btl self --tag-output --timestamp-output -mca pml ucx -mca coll '^hcoll' --bind-to core -x UCX_NET_DEVICES=mlx5_3:1 -mca osc ucx -x UCX_IB_REG_METHODS=rcache,direct -x UCX_TLS=dc_x,self,sm -mca pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 -mca pmix_base_collect_data 1 --map-by node --mca pmix_server_max_wait 8 /hpc/local/benchmarks/hpcx_install_2019-06-16/mtt-tests-gcc/installs/O7XR/tests/atest/mtt-tests.git/atest/src/atest -c 1,3 --test-cross 0

brminich commented 5 years ago

duplicate of #2934 PI and CI are different one some qps