ohpc-openeuler / ohpc

OpenHPC Integration, Packaging, and Test Repo
http://openhpc.community
Apache License 2.0
1 stars 3 forks source link

mumps rm_execution failed #83

Open Yikun opened 1 year ago

Yikun commented 1 year ago
1..5
not ok 1 [libs/Mumps] C (double precision) runs under resource manager (slurm/gnu12/openmpi4)
# (from function `run_mpi_binary' in file ./common/functions, line 388,
#  in test file rm_execution, line 26)
#   `run_mpi_binary $EXE $ARGS $NODES $TASKS' failed
# job script = /tmp/job.ohpc.4393
# Batch job 87 submitted
#
# Job 87 failed...
# Reason=NonZeroExitCode
#
# [prun] Master compute host = af75dbf2b42d
# [prun] Resource manager = slurm
# [prun] Launch cmd = mpirun ./C_double null (family=openmpi4)
# [af75dbf2b42d:187729:0:187729] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa2b860)
# [af75dbf2b42d:187728:0:187728] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa2b860)
# ==== backtrace (tid: 187729) ====
#  0  /opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7ff5087ffa64]
#  1  /opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/libucs.so.0(+0x28c6f) [0x7ff5087ffc6f]
#  2  /opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/libucs.so.0(+0x28f56) [0x7ff5087fff56]
#  3  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x154) [0x7ff509126bc4]
#  4  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(+0x4e4c) [0x7ff509126e4c]
#  5  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(opal_progress+0x2c) [0x7ff51aeb138c]
#  6  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_nextcid+0x95) [0x7ff51b7ac715]
#  7  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_dup_with_info+0xdd) [0x7ff51b7a536d]
# ==== backtrace (tid: 187728) ====
#  0  /opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f9975a26a64]
#  1  /opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/libucs.so.0(+0x28c6f) [0x7f9975a26c6f]
#  2  /opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/libucs.so.0(+0x28f56) [0x7f9975a26f56]
#  3  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x154) [0x7f997634dbc4]
#  4  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(+0x4e4c) [0x7f997634de4c]
#  5  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(opal_progress+0x2c) [0x7f99880d838c]
#  8  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(MPI_Comm_dup+0x60) [0x7ff51b7e1f70]
#  9  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi_mpifh.so.40(PMPI_Comm_dup_f+0x2e) [0x7ff51b11eb1e]
# 10  /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_+0x95) [0x7ff51babb9a5]
# 11  /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_f77_+0x38d4) [0x7ff51bac4474]
# 12  /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_c+0x935) [0x7ff51bab9f05]
# 13  ./C_double() [0x4010fd]
# 14  /usr/lib64/libc.so.6(+0x2d210) [0x7ff51b597210]
# 15  /usr/lib64/libc.so.6(__libc_start_main+0x7c) [0x7ff51b5972bc]
# 16  ./C_double() [0x401245]
# =================================
# [af75dbf2b42d:187729] *** Process received signal ***
#  6  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_nextcid+0x95) [0x7f99889d3715]
#  7  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_dup_with_info+0xdd) [0x7f99889cc36d]
#  8  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(MPI_Comm_dup+0x60) [0x7f9988a08f70]
#  9  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi_mpifh.so.40(PMPI_Comm_dup_f+0x2e) [0x7f9988345b1e]
# 10  /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_+0x95) [0x7f9988ce29a5]
# 11  /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_f77_+0x38d4) [0x7f9988ceb474]
# 12  /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_c+0x935) [0x7f9988ce0f05]
# 13  ./C_double() [0x4010fd]
# 14  /usr/lib64/libc.so.6(+0x2d210) [0x7f99887be210]
# 15  /usr/lib64/libc.so.6(__libc_start_main+0x7c) [0x7f99887be2bc]
# 16  ./C_double() [0x401245]
# =================================
# [af75dbf2b42d:187728] *** Process received signal ***
# [af75dbf2b42d:187729] Signal: Segmentation fault (11)
# [af75dbf2b42d:187729] Signal code:  (-6)
# [af75dbf2b42d:187729] Failing at address: 0x3e80002dd51
# [af75dbf2b42d:187728] Signal: Segmentation fault (11)
# [af75dbf2b42d:187728] Signal code:  (-6)
# [af75dbf2b42d:187728] Failing at address: 0x3e80002dd50
# [af75dbf2b42d:187728] [af75dbf2b42d:187729] [ 0] [ 0] /usr/lib64/libc.so.6(+0x41070)[0x7ff51b5ab070]
# [af75dbf2b42d:187729] [ 1] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x154)[0x7ff509126bc4]
# [af75dbf2b42d:187729] [ 2] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(+0x4e4c)[0x7ff509126e4c]
# [af75dbf2b42d:187729] [ 3] /usr/lib64/libc.so.6(+0x41070)[0x7f99887d2070]
# [af75dbf2b42d:187728] [ 1] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x154)[0x7f997634dbc4]
# [af75dbf2b42d:187728] [ 2] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/openmpi/mca_btl_vader.so(+0x4e4c)[0x7f997634de4c]
# [af75dbf2b42d:187728] [ 3] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7ff51aeb138c]
# [af75dbf2b42d:187729] [ 4] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f99880d838c]
# [af75dbf2b42d:187728] [ 4] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_nextcid+0x95)[0x7ff51b7ac715]
# [af75dbf2b42d:187729] [ 5] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_nextcid+0x95)[0x7f99889d3715]
# [af75dbf2b42d:187728] [ 5] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_dup_with_info+0xdd)[0x7ff51b7a536d]
# [af75dbf2b42d:187729] [ 6] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(ompi_comm_dup_with_info+0xdd)[0x7f99889cc36d]
# [af75dbf2b42d:187728] [ 6] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(MPI_Comm_dup+0x60)[0x7ff51b7e1f70]
# [af75dbf2b42d:187729] [ 7] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi.so.40(MPI_Comm_dup+0x60)[0x7f9988a08f70]
# [af75dbf2b42d:187728] [ 7] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi_mpifh.so.40(PMPI_Comm_dup_f+0x2e)[0x7ff51b11eb1e]
# [af75dbf2b42d:187729] [ 8] /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.4/lib/libmpi_mpifh.so.40(PMPI_Comm_dup_f+0x2e)[0x7f9988345b1e]
# [af75dbf2b42d:187728] [ 8] /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_+0x95)[0x7ff51babb9a5]
# [af75dbf2b42d:187729] [ 9] /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_f77_+0x38d4)[0x7ff51bac4474]
# [af75dbf2b42d:187729] [10] /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_c+0x935)[0x7ff51bab9f05]
# [af75dbf2b42d:187729] [11] ./C_double[0x4010fd]
# [af75dbf2b42d:187729] [12] /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_+0x95)[0x7f9988ce29a5]
# [af75dbf2b42d:187728] [ 9] /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_f77_+0x38d4)[0x7f9988ceb474]
# [af75dbf2b42d:187728] [10] /usr/lib64/libc.so.6(+0x2d210)[0x7ff51b597210]
# [af75dbf2b42d:187729] [13] /opt/ohpc/pub/libs/gnu12/openmpi4/mumps/5.2.1/lib/libdmumps.so.5.0.0(dmumps_c+0x935)[0x7f9988ce0f05]
# [af75dbf2b42d:187728] [11] ./C_double[0x4010fd]
# [af75dbf2b42d:187728] [12] /usr/lib64/libc.so.6(__libc_start_main+0x7c)[0x7ff51b5972bc]
# [af75dbf2b42d:187729] /usr/lib64/libc.so.6(+0x2d210)[0x7f99887be210]
# [af75dbf2b42d:187728] [13] [14] ./C_double[0x401245]
# [af75dbf2b42d:187729] *** End of error message ***
# /usr/lib64/libc.so.6(__libc_start_main+0x7c)[0x7f99887be2bc]
# [af75dbf2b42d:187728] [14] ./C_double[0x401245]
# [af75dbf2b42d:187728] *** End of error message ***
# --------------------------------------------------------------------------
# Primary job  terminated normally, but 1 process returned
# a non-zero exit code. Per user-direction, the job has been aborted.
# --------------------------------------------------------------------------
# --------------------------------------------------------------------------
# mpirun noticed that process rank 1 with PID 187729 on node c0 exited on signal 11 (Segmentation fault).
# --------------------------------------------------------------------------
martin-g commented 1 year ago

./tests/libs/mumps/tests/family-gnu12-mvapich2/rm_execution.log on 22.03 aarch64:

1..5
not ok 1 [libs/Mumps] C (double precision) runs under resource manager (slurm/gnu12/mvapich2)
# (from function `run_mpi_binary' in file ./common/functions, line 388,
#  in test file rm_execution, line 26)
#   `run_mpi_binary $EXE $ARGS $NODES $TASKS' failed
# job script = /tmp/job..19126
# Batch job 400 submitted
#  
# Job 400 failed...
# Reason=NonZeroExitCode
#  
# [prun] Master compute host = ohpc-docker
# [prun] Resource manager = slurm
# [prun] Launch cmd = mpiexec.hydra -bootstrap slurm ./C_double null (family=mvapich2)
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] **********************WARNING***********************
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] Failed to automatically detect the CPU architecture.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] This may lead to subpar communication performance.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] ****************************************************
# [34793baeb383:mpi_rank_2][rdma_open_hca] No HCAs found on the system.
# [34793baeb383:mpi_rank_1][rdma_open_hca] No HCAs found on the system.
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
# [cli_1]: aborting job:
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
not ok 2 [libs/Mumps] Fortran (single precision) runs under resource manager (slurm/gnu12/mvapich2)
# (from function `run_mpi_binary' in file ./common/functions, line 388,
#  in test file rm_execution, line 36)
#   `run_mpi_binary $EXE $ARGS $NODES $TASKS' failed
# job script = /tmp/job..25045
# Batch job 401 submitted
#  
# Job 401 failed...
# Reason=NonZeroExitCode
#  
# [prun] Master compute host = ohpc-docker
# [prun] Resource manager = slurm
# [prun] Launch cmd = mpiexec.hydra -bootstrap slurm ./F_single null (family=mvapich2)
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] **********************WARNING***********************
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] Failed to automatically detect the CPU architecture.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] This may lead to subpar communication performance.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] ****************************************************
# [34793baeb383:mpi_rank_2][rdma_open_hca] No HCAs found on the system.
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
# [cli_2]: aborting job:
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
not ok 3 [libs/Mumps] Fortran (double precision) runs under resource manager (slurm/gnu12/mvapich2)
# (from function `run_mpi_binary' in file ./common/functions, line 388,
#  in test file rm_execution, line 46)
#   `run_mpi_binary $EXE $ARGS $NODES $TASKS' failed
# job script = /tmp/job..9794
# Batch job 402 submitted
#  
# Job 402 failed...
# Reason=NonZeroExitCode
#  
# [prun] Master compute host = ohpc-docker
# [prun] Resource manager = slurm
# [prun] Launch cmd = mpiexec.hydra -bootstrap slurm ./F_double null (family=mvapich2)
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] **********************WARNING***********************
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] Failed to automatically detect the CPU architecture.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] This may lead to subpar communication performance.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] ****************************************************
# [34793baeb383:mpi_rank_2][rdma_open_hca] No HCAs found on the system.
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
# [cli_2]: aborting job:
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
not ok 4 [libs/Mumps] Fortran (complex) runs under resource manager (slurm/gnu12/mvapich2)
# (from function `run_mpi_binary' in file ./common/functions, line 388,
#  in test file rm_execution, line 56)
#   `run_mpi_binary $EXE $ARGS $NODES $TASKS' failed
# job script = /tmp/job..3970
# Batch job 403 submitted
#  
# Job 403 failed...
# Reason=NonZeroExitCode
#  
# [prun] Master compute host = ohpc-docker
# [prun] Resource manager = slurm
# [prun] Launch cmd = mpiexec.hydra -bootstrap slurm ./F_complex null (family=mvapich2)
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] **********************WARNING***********************
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] Failed to automatically detect the CPU architecture.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] This may lead to subpar communication performance.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] ****************************************************
# [34793baeb383:mpi_rank_0][rdma_open_hca] No HCAs found on the system.
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
# [cli_0]: aborting job:
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
not ok 5 [libs/Mumps] Fortran (double complex) runs under resource manager (slurm/gnu12/mvapich2)
# (from function `run_mpi_binary' in file ./common/functions, line 388,
#  in test file rm_execution, line 66)
#   `run_mpi_binary $EXE $ARGS $NODES $TASKS' failed
# job script = /tmp/job..8621
# Batch job 404 submitted
#  
# Job 404 failed...
# Reason=NonZeroExitCode
#  
# [prun] Master compute host = ohpc-docker
# [prun] Resource manager = slurm
# [prun] Launch cmd = mpiexec.hydra -bootstrap slurm ./F_doublecomplex null (family=mvapich2)
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] **********************WARNING***********************
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] Failed to automatically detect the CPU architecture.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] This may lead to subpar communication performance.
# [34793baeb383:mpi_rank_0][mv2_get_arch_type] ****************************************************
# [34793baeb383:mpi_rank_0][rdma_open_hca] No HCAs found on the system.
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
# [cli_0]: aborting job:
# Fatal error in MPI_Init:
# Other MPI error, error stack:
# MPIR_Init_thread(493)............: 
# MPID_Init(419)...................: channel initialization failed
# MPIDI_CH3_Init(515)..............: rdma_get_control_parameters
# rdma_get_control_parameters(2023): rdma_open_hca
# rdma_open_hca(1091)..............: No IB device found
# 
FAIL rm_execution (exit status: 1)