ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
573 stars 382 forks source link

prov/tcp: hanging issues when running Intel-MPI U6 with tcp provider #5923

Closed zhngaj closed 3 years ago

zhngaj commented 4 years ago

Hi,

We saw a few hanging issues when running IMB with Intel MPI U6 with its internal libfabric tcp provider. (1152 procs = 32 nodes * 36 proc/node)

[ec2-user@ip-172-31-11-68 fsx]$ mpirun -n 1152 -f /fsx/hosts.file  /fsx/intel-mpi-benchmarks/IMB-MPI1 PingPong -npmin 1152
[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 5, MPI-1 part
#------------------------------------------------------------
# Date                  : Thu May  7 00:40:59 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.14.165-103.209.amzn1.x86_64
# Version               : #1 SMP Sun Feb 9 00:23:26 UTC 2020
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /fsx/intel-mpi-benchmarks/IMB-MPI1 PingPong -npmin 1152

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
(... hanging)
[ec2-user@ip-172-31-11-68 intel-mpi-benchmarks]$ mpirun -n 1152 -f /fsx/hosts.file  /fsx/intel-mpi-benchmarks/IMB-NBC Ibcast
[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 5, MPI-NBC part
#------------------------------------------------------------
# Date                  : Thu May  7 00:48:38 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.14.165-103.209.amzn1.x86_64
# Version               : #1 SMP Sun Feb 9 00:23:26 UTC 2020
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /fsx/intel-mpi-benchmarks/IMB-NBC Ibcast

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Ibcast
(.. hanging)
zhngaj commented 4 years ago

Add the stack trace of hanging in IMB-MPI1 PingPong for your reference.

(gdb) bt
#0  0x00007ff4cdec8a5d in recv () from /lib64/libpthread.so.0
#1  0x00007ff3f9fc8292 in tcpx_read_to_buffer ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#2  0x00007ff3f9fc7592 in tcpx_ep_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#3  0x00007ff3f9fc7309 in tcpx_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#4  0x00007ff3f9fd989d in ofi_cq_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#5  0x00007ff3f9fda54b in ofi_cq_readfrom ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#6  0x00007ff3f9b37f0e in rxm_ep_do_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#7  0x00007ff3f9b39e19 in rxm_ep_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#8  0x00007ff3f9b4fe6d in ofi_cq_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#9  0x00007ff3f9b50b1b in ofi_cq_readfrom ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#10 0x00007ff4cea67c7d in MPIDI_NM_progress_impl (vci=<optimized out>, blocking=<optimized out>)
    at ../../src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:39
#11 MPIDI_NM_progress (vci=428, blocking=36615584) at ../../src/mpid/ch4/netmod/ofi/util.c:26
#12 0x00007ff4ce95db8e in PMPI_Recv (buf=0x1ac, count=36615584, datatype=512, source=-840136099, tag=0, comm=0,
    status=0x7ffd10c2ed80) at ../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:266
#13 0x000000000045159a in IMB_init_communicator ()
#14 0x000000000042fca5 in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)0>, &IMB_pingpong>::run(scope_item const&) ()
#15 0x0000000000405473 in main ()
bsbernd commented 4 years ago

Is this the only thread? The socket is set to non-blocking - are you sure it hangs there and not in a different thread?

zhngaj commented 4 years ago

We have a more verbose stacktrace for the IMB-MPI1 PingPong hanging produced with intel-mpi debug mode.

(gdb) bt
#0  0x00007fd80b22ea5d in recv () from /lib64/libpthread.so.0
#1  0x00007fd739f22292 in tcpx_read_to_buffer ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#2  0x00007fd739f21592 in tcpx_ep_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#3  0x00007fd739f21309 in tcpx_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#4  0x00007fd739f3389d in ofi_cq_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#5  0x00007fd739f3454b in ofi_cq_readfrom ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/libtcp-fi.so
#6  0x00007fd73a164f0e in rxm_ep_do_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#7  0x00007fd73a166e19 in rxm_ep_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#8  0x00007fd73a17ce6d in ofi_cq_progress ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#9  0x00007fd73a17db1b in ofi_cq_readfrom ()
   from /opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/lib/prov/librxm-fi.so
#10 0x00007fd80c009b5b in fi_cq_read (cq=0xd48cb0, buf=0x7ffcdfa79cd8, count=1) at /usr/include/rdma/fi_eq.h:385
#11 0x00007fd80c015f85 in MPIDI_NM_progress_impl (vci=0, blocking=1)
    at ../../src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:39
#12 0x00007fd80c016334 in MPIDI_NM_progress (vci=0, blocking=1) at ../../src/mpid/ch4/netmod/ofi/util.c:26
#13 0x00007fd80bec11c3 in MPIDI_NM_mpi_recv (buf=0x7f20003a1308, count=286, datatype=1275069445, rank=7, tag=1000,
    comm=0x7fd80d3938e0 <MPIR_Comm_builtin>, context_offset=0, addr=0x7f2000380278, status=0x7ffcdfa7a700,
    request=0x7ffcdfa7a530) at ../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:266
#14 0x00007fd80bec36c8 in MPIDI_recv_unsafe (buf=0x7f20003a1308, count=286, datatype=1275069445, rank=7, tag=1000,
    comm=0x7fd80d3938e0 <MPIR_Comm_builtin>, context_offset=0, av=0x7f2000380278, status=0x7ffcdfa7a700,
    request=0x7ffcdfa7a530) at ../../src/mpid/ch4/src/ch4_recv.h:175
#15 0x00007fd80bec3ace in MPIDI_recv_safe (buf=0x7f20003a1308, count=286, datatype=1275069445, rank=7, tag=1000,
    comm=0x7fd80d3938e0 <MPIR_Comm_builtin>, context_offset=0, av=0x7f2000380278, status=0x7ffcdfa7a700,
    req=0x7ffcdfa7a530) at ../../src/mpid/ch4/src/ch4_recv.h:405
#16 0x00007fd80bec3d23 in MPID_Recv (buf=0x7f20003a1308, count=286, datatype=1275069445, rank=7, tag=1000,
    comm=0x7fd80d3938e0 <MPIR_Comm_builtin>, context_offset=0, status=0x7ffcdfa7a700, request=0x7ffcdfa7a530)
    at ../../src/mpid/ch4/src/ch4_recv.h:561
#17 0x00007fd80bec50ae in PMPI_Recv (buf=0x7f20003a1308, count=286, datatype=1275069445, source=7, tag=1000,
    comm=1140850688, status=0x7ffcdfa7a700) at ../../src/mpi/pt2pt/recv.c:135
#18 0x000000000045159a in IMB_init_communicator ()
#19 0x000000000042fca5 in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)0>, &IMB_pingpong>::run(scope_item const&) ()
#20 0x0000000000405473 in main ()

Yes, we did not see other threads. We also ran with other provider, and did not see such issue. So we think it's hanging.

zhngaj commented 4 years ago

Hello,

We gdb attach to the hanging process, and do thread apply all bt. We still can only see one thread.

ddurnov commented 4 years ago

It is Intel MPI specific build. Do you see the issue with public version of libfabric?

zhngaj commented 4 years ago

I switched to Intel MPI U7 + IMB 2019 U6 , and tried with both internal libfabric and public libfabric v1.10.1. The pingpong test is still hanging for me.

Internal libfabric 1.10.0a1-impi

[ec2-user@ip-172-31-4-63 fsx]$ mpirun -n 1152 -f /fsx/hosts -env I_MPI_DEBUG=1 -env I_MPI_OFI_LIBRARY_INTERNAL=1 -env I_MPI_OFI_PROVIDER=tcp  /opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/IMB-MPI1 PingPong
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-1 part
#------------------------------------------------------------
# Date                  : Thu Jun 18 14:47:51 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.14.177-139.254.amzn2.x86_64
# Version               : #1 SMP Thu May 7 18:48:23 UTC 2020
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/IMB-MPI1 PingPong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
<... hanging>

public libfabric v1.10.1

[ec2-user@ip-172-31-4-63 fsx]$ mpirun -n 1152 -f /fsx/hosts -env I_MPI_DEBUG=1 -env I_MPI_OFI_LIBRARY_INTERNAL=0 -env I_MPI_OFI_PROVIDER=tcp  /opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/IMB-MPI1 PingPong
[0] MPI startup(): libfabric version: 1.10.1
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-1 part
#------------------------------------------------------------
# Date                  : Thu Jun 18 14:41:30 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.14.177-139.254.amzn2.x86_64
# Version               : #1 SMP Thu May 7 18:48:23 UTC 2020
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/IMB-MPI1 PingPong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
<... hanging>
barisguler commented 4 years ago

any fix for this?

shefty commented 4 years ago

In runs that reproduced the issue, the core issue was related to system config settings.

FI_PROVIDER=tcp mpirun -n 1792 -ppn 128 -f ~/hosts IMB-MPI1 pingpong

These were the system configuration values that I needed to adjust.

net.ipv4.tcp_syncookies = 0 net.core.somaxconn = 8192 net.core.netdev_max_backlog = 4096 net.ipv4.tcp_max_syn_backlog = 8192

/etc/security/limits.conf

To run at 2688 ranks (192 per node), I used a custom version of libfabric:

> diff --git a/prov/tcp/src/tcpx_ep.c b/prov/tcp/src/tcpx_ep.c index 4e8d8da..40ff0f9
> 100644
> --- a/prov/tcp/src/tcpx_ep.c
> +++ b/prov/tcp/src/tcpx_ep.c
> @@ -666,7 +666,7 @@ static int tcpx_pep_listen(struct fid_pep *pep)
> 
>         tcpx_pep = container_of(pep,struct tcpx_pep, util_pep.pep_fid);
> 
> -       if (listen(tcpx_pep->sock, SOMAXCONN)) {
> +       if (listen(tcpx_pep->sock, 4096)) {
>                 FI_WARN(&tcpx_prov, FI_LOG_EP_CTRL,
>                         "socket listen failed\n");
>                 return -ofi_sockerr();

The patch above has been merged into master.

shefty commented 4 years ago

There were 2 different deadlock issues found in the tcp provider, which are now fixed in the upstream master. Without those fixes, it's possible for tcp to hang, particularly when a connection is being shutdown by a peer. You would need a custom build of libfabric (for now) to see if it addresses the issue here.

Although the deadlock is possible and reproducible on other apps, I'm not aware of it occurring with MPI. It is possible however.

github-actions[bot] commented 3 years ago

There has been no activity on this issue for more than 360 days. Marking it stale.