Open jiandewang opened 3 months ago
I am going to open a HERA ticket to see whether SA can give us some clue
from HERA SA: based on the error output, it looks like your job is trying to use TCP versus IB/RDMA. I think this could be part of the problem.
Looking at your job stack, looks like you are using a custom install of openmpi. I saw from "ompi_info" that looks like it was build with "--without-verbs". This might be normal but typically for IB to work you want verbs.
We believe the first step is to look at your openmpi stack as it does not appear to be setup to use IB.
@jkbk2004 @RatkoVasic-NOAA FYI
I think it might be openmpi issue with gnu. s2sw_pdlib_debug_gnu/cpld_debug_pdlib_p8 job is hanging.
25: WARNING: Open MPI failed to TCP connect to a peer MPI process. This
25: should not happen.
25:
25: Your Open MPI job may now hang or fail.
25:
25: Local host: h1c10
25: PID: 145750
25: Message: connect() to 10.184.4.41:1061 failed
25: Error: Resource temporarily unavailable (11)
25: --------------------------------------------------------------------------
18: --------------------------------------------------------------------------
18: WARNING: Open MPI failed to TCP connect to a peer MPI process. This
18: should not happen.
18:
18: Your Open MPI job may now hang or fail.
18:
18: Local host: h1c10
18: PID: 145743
18: Message: connect() to 10.184.4.41:1031 failed
18: Error: Resource temporarily unavailable (11)
18: --------------------------------------------------------------------------
32: [h1c10:145757] 10 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
32: [h1c10:145757] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
0: slurmstepd: error: *** STEP 64288223.0 ON h1c10 CANCELLED AT 2024-08-01T04:56:25 DUE TO TIME LIMIT ***
I think it might be openmpi issue with gnu. s2sw_pdlib_debug_gnu/cpld_debug_pdlib_p8 job is hanging.
25: WARNING: Open MPI failed to TCP connect to a peer MPI process. This 25: should not happen. 25: 25: Your Open MPI job may now hang or fail. 25: 25: Local host: h1c10 25: PID: 145750 25: Message: connect() to 10.184.4.41:1061 failed 25: Error: Resource temporarily unavailable (11) 25: -------------------------------------------------------------------------- 18: -------------------------------------------------------------------------- 18: WARNING: Open MPI failed to TCP connect to a peer MPI process. This 18: should not happen. 18: 18: Your Open MPI job may now hang or fail. 18: 18: Local host: h1c10 18: PID: 145743 18: Message: connect() to 10.184.4.41:1031 failed 18: Error: Resource temporarily unavailable (11) 18: -------------------------------------------------------------------------- 32: [h1c10:145757] 10 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail 32: [h1c10:145757] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 0: slurmstepd: error: *** STEP 64288223.0 ON h1c10 CANCELLED AT 2024-08-01T04:56:25 DUE TO TIME LIMIT ***
@RatkoVasic-NOAA @ulmononian we need a plan to resolve the issue. Different problem on Hercules but openmpi used to cause issue on hercules as well.
looks like similar issue reported: https://github.com/open-mpi/ompi/issues/11508
another information with TCP/OMPI issue: https://github.com/open-mpi/ompi/issues/10734
Description
During my testing of new MOM6 code in UWM I found s2sw_pdlib_debug gnu job hanging on HERA but it is non-reproduciable. So I turned back to use current UWM version, cloned this morning, commit b5a1976012b66352f403588f95e639c6827b97d4 (HEAD -> develop, origin/develop, origin/HEAD) Author: Dusan Jovic 48258889+DusanJovic-NOAA@users.noreply.github.com Date: Tue Jul 30 07:17:15 2024 -0400
but I found the same situation. I repeated it 10 times and found out 5 of them succeeded while the rest of them hanged and timed out. I also repeated 10 times on hercules and all of them are fine.
I have a strong feeling that this issue is related to machine rather to code or resource settings but I don't know how to prove it. I tested this job many times before today's trying of UWM develop branch. What I found is that it heavily depends on when I sbumit the job. I had times that all my 20 tries were OK and I also had times that more than half of my job hanged.
To Reproduce:
clone latest UWM run s2sw_pdlib_debug gnu job, repeat it several times. Some will run fine but some will time out
Additional context
error information: 180: WARNING: Open MPI failed to TCP connect to a peer MPI process. This 180: should not happen. 180: 180: Your Open MPI job may now hang or fail. 180: 180: Local host: h4c43 180: PID: 3159416 180: Message: connect() to 10.184.4.51:1034 failed 180: Error: Resource temporarily unavailable (11)
Output
see HERA /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_324380-HEAD-pdlib/cpld_debug_pdlib_p8_gnu/err