Closed wzamazon closed 1 year ago
Thank you for taking the time to submit an issue!
main branch
through mtt
git submodule status
22fe51cb7a961b6060fc5c48e659237cbe162566 3rd-party/openpmix (v1.1.3-3872-g22fe51cb) ece4f3c45a07a069e5b8f9c5e641613dfcaeffc3 3rd-party/prrte (psrvr-v2.0.0rc1-4638-gece4f3c45a) c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)
error message
queue-c5n18xlarge-st-c5n18xlarge-1:44025] *** Process received signal *** [queue-c5n18xlarge-st-c5n18xlarge-1:44025] Signal: Segmentation fault (11) [queue-c5n18xlarge-st-c5n18xlarge-1:44025] Signal code: Address not mapped (1) [queue-c5n18xlarge-st-c5n18xlarge-1:44025] Failing at address: 0x10001 [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7ffad34d68e0] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 1] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x37695a)[0x7ffad3a5995a] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 2] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x190493)[0x7ffad3873493] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 3] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_base_reduce_generic+0x51b)[0x7ffad3873c6f] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 4] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_base_reduce_intra_binomial+0x187)[0x7ffad387490b] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 5] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_do_this+0x1ca)[0x7ffad3899dec] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 6] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_dec_fixed+0x46e)[0x7ffad3891c75] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 7] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x1eea63)[0x7ffad38d1a63] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 8] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x1ec73c)[0x7ffad38cf73c] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 9] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(mca_coll_han_reduce_intra+0x1a7c)[0x7ffad38d12ef] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [10] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(mca_coll_han_reduce_intra_dynamic+0x3f8)[0x7ffad38ecf67] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [11] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(MPI_Reduce+0x48d)[0x7ffad3805d10] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [12] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/TestGet_IBM/ompi-tests/ibm/collective/reduce_big_in_place[0x400ef8] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7ffad313913a] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] [14] /home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/TestGet_IBM/ompi-tests/ibm/collective/reduce_big_in_place[0x400d2a] [queue-c5n18xlarge-st-c5n18xlarge-1:44025] *** End of error message *** -------------------------------------------------------------------------- prterun noticed that process rank 1 with PID 44025 on node queue-c5n18xlarge-st-c5n18xlarge-1 exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------
mtt link:
https://mtt.open-mpi.org/index.php?limit=&wrap=&trial=&enable_drilldowns=&yaxis_scale=&xaxis_scale=&hide_subtitle=&split_graphs=&remote_go=&do_cookies=&phase=test_run&text_start_timestamp=2023-07-04+14%3A48%3A40+-+2023-07-05+14%3A48%3A40&text_platform_hardware=%5Ex86_64%24&show_platform_hardware=show&text_os_name=%5ELinux%24&show_os_name=show&text_mpi_name=%5Eompi-nightly-main%24&show_mpi_name=show&text_mpi_version=%5Emain-202306290241-1a73735%24&show_mpi_version=show&text_suite_name=%5ETestBuild%3AIBMInstalled%24&show_suite_name=show&text_test_name=reduce_big_in_place&show_test_name=hide&text_np=%5E144%24&show_np=show&text_full_command=&show_full_command=show&text_http_username=%5Eamazon%24&show_http_username=show&text_local_username=all&show_local_username=hide&text_platform_name=%5Eaws-amazonlinux-slurm-efa-installer%24&show_platform_name=show&click=Detail&phase=test_run&test_result=_run_f&text_compute_cluster_id=&text_os_version=&show_os_version=&text_platform_type=&show_platform_type=&text_submit_id=&text_hostname=&show_hostname=&text_mpi_get_id=&text_mpi_install_compiler_id=&text_mpi_install_configure_id=&text_mpi_install_id=&text_compiler_name=&show_compiler_name=&text_compiler_version=&show_compiler_version=&text_vpath_mode=&show_vpath_mode=&text_endian=&show_endian=&text_bitness=&show_bitness=&text_configure_arguments=&text_exit_value=&show_exit_value=&text_exit_signal=&show_exit_signal=&text_duration=&show_duration=&text_client_serial=&show_client_serial=&text_result_message=&text_result_stdout=&text_result_stderr=&text_environment=&text_description=&text_test_build_id=&text_test_build_compiler_id=&text_test_run_id=&text_test_name_id=&text_test_run_command_id=&text_test_suite_id=&text_performance_id=&text_launcher=&show_launcher=&text_resource_mgr=&show_resource_mgr=&text_network=&show_network=&text_parameters=&show_parameters=&lastgo=summary
https://github.com/open-mpi/ompi/pull/11800 fix this
PR has been merged and backported
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
through mtt
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
error message