open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 844 forks source link

Open MPI main branch fail reduce_big_in_place test #11799

Closed wzamazon closed 1 year ago

wzamazon commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

through mtt

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 22fe51cb7a961b6060fc5c48e659237cbe162566 3rd-party/openpmix (v1.1.3-3872-g22fe51cb)
 ece4f3c45a07a069e5b8f9c5e641613dfcaeffc3 3rd-party/prrte (psrvr-v2.0.0rc1-4638-gece4f3c45a)
 c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running


Details of the problem

error message


queue-c5n18xlarge-st-c5n18xlarge-1:44025] *** Process received signal ***
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] Signal: Segmentation fault (11)
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] Signal code: Address not mapped (1)
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] Failing at address: 0x10001
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7ffad34d68e0]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 1]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x37695a)[0x7ffad3a5995a]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 2]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x190493)[0x7ffad3873493]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 3]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_base_reduce_generic+0x51b)[0x7ffad3873c6f]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 4]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_base_reduce_intra_binomial+0x187)[0x7ffad387490b]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 5]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_do_this+0x1ca)[0x7ffad3899dec]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 6]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_dec_fixed+0x46e)[0x7ffad3891c75]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 7]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x1eea63)[0x7ffad38d1a63]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 8]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(+0x1ec73c)[0x7ffad38cf73c]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [ 9]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(mca_coll_han_reduce_intra+0x1a7c)[0x7ffad38d12ef]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [10]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(mca_coll_han_reduce_intra_dynamic+0x3f8)[0x7ffad38ecf67]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [11]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/MiddlewareBuild_OMPI/lib/libmpi.so.0(MPI_Reduce+0x48d)[0x7ffad3805d10]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [12]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/TestGet_IBM/ompi-tests/ibm/collective/reduce_big_in_place[0x400ef8]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [13]
/lib64/libc.so.6(__libc_start_main+0xea)[0x7ffad313913a]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] [14]
/home/ec2-user/PortaFiducia/workloads/mtt/job/rizuka/scratch/TestGet_IBM/ompi-tests/ibm/collective/reduce_big_in_place[0x400d2a]
[queue-c5n18xlarge-st-c5n18xlarge-1:44025] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 44025 on node queue-c5n18xlarge-st-c5n18xlarge-1 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
wzamazon commented 1 year ago

mtt link:

https://mtt.open-mpi.org/index.php?limit=&wrap=&trial=&enable_drilldowns=&yaxis_scale=&xaxis_scale=&hide_subtitle=&split_graphs=&remote_go=&do_cookies=&phase=test_run&text_start_timestamp=2023-07-04+14%3A48%3A40+-+2023-07-05+14%3A48%3A40&text_platform_hardware=%5Ex86_64%24&show_platform_hardware=show&text_os_name=%5ELinux%24&show_os_name=show&text_mpi_name=%5Eompi-nightly-main%24&show_mpi_name=show&text_mpi_version=%5Emain-202306290241-1a73735%24&show_mpi_version=show&text_suite_name=%5ETestBuild%3AIBMInstalled%24&show_suite_name=show&text_test_name=reduce_big_in_place&show_test_name=hide&text_np=%5E144%24&show_np=show&text_full_command=&show_full_command=show&text_http_username=%5Eamazon%24&show_http_username=show&text_local_username=all&show_local_username=hide&text_platform_name=%5Eaws-amazonlinux-slurm-efa-installer%24&show_platform_name=show&click=Detail&phase=test_run&test_result=_run_f&text_compute_cluster_id=&text_os_version=&show_os_version=&text_platform_type=&show_platform_type=&text_submit_id=&text_hostname=&show_hostname=&text_mpi_get_id=&text_mpi_install_compiler_id=&text_mpi_install_configure_id=&text_mpi_install_id=&text_compiler_name=&show_compiler_name=&text_compiler_version=&show_compiler_version=&text_vpath_mode=&show_vpath_mode=&text_endian=&show_endian=&text_bitness=&show_bitness=&text_configure_arguments=&text_exit_value=&show_exit_value=&text_exit_signal=&show_exit_signal=&text_duration=&show_duration=&text_client_serial=&show_client_serial=&text_result_message=&text_result_stdout=&text_result_stderr=&text_environment=&text_description=&text_test_build_id=&text_test_build_compiler_id=&text_test_run_id=&text_test_name_id=&text_test_run_command_id=&text_test_suite_id=&text_performance_id=&text_launcher=&show_launcher=&text_resource_mgr=&show_resource_mgr=&text_network=&show_network=&text_parameters=&show_parameters=&lastgo=summary

wzamazon commented 1 year ago

https://github.com/open-mpi/ompi/pull/11800 fix this

wzamazon commented 1 year ago

PR has been merged and backported