open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

Won't build openmpi-3.1.3 with no extra flags #6024

Closed ontheklaud closed 6 years ago

ontheklaud commented 6 years ago

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

[build procedure] 1> wget 2> tar xf openmpi-3.1.3.tar.gz 3> cd openmpi-3.1.3 4> ./configure (again, with no extra flags) 5> make -j16 6> Ta-da! with Segmentation Fault

[configure output]

config.status: executing libtool commands

Open MPI configuration:
/-----------------------
Version: 3.1.3
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)

Miscellaneous
/-----------------------
CUDA support: no
PMIx support: internal

Transports
/-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel SCIF: no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: no
OpenFabrics Libfabric: no
OpenFabrics Verbs: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Resource Managers
/-----------------------
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no

OMPIO File Systems
/-----------------------
Generic Unix FS: yes
Lustre: no
PVFS2/OrangeFS: no

[tailed error]

Making all in mca/io/romio314
make[2]: Entering directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314'
Making all in romio
make[3]: Entering directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio'
make[4]: Entering directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio'
  CC       mpi-io/delete.lo
  CC       mpi-io/close.lo
  CC       mpi-io/fsync.lo
  CC       mpi-io/get_amode.lo
  CC       mpi-io/get_atom.lo
  CC       mpi-io/get_bytoff.lo
  CC       mpi-io/get_extent.lo
  CC       mpi-io/get_posn_sh.lo
  CC       mpi-io/get_info.lo
  CC       mpi-io/get_posn.lo
  CC       mpi-io/get_group.lo
  CC       mpi-io/get_size.lo
  CC       mpi-io/get_view.lo
  CC       mpi-io/iread.lo
  CC       mpi-io/iread_at.lo
  CC       mpi-io/iread_sh.lo
**/bin/sh: line 2: 20725 Segmentation fault      /bin/sh ./libtool --silent --tag=CC --mode=compile /home/xo/opt/gcc-8.2.0/bin/gcc -DHAVE_CONFIG_H** -I. -I./adio/include -DOMPI_BUILDING=1 -I/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio/../../../../.. -I/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio/../../../../../opal/include -I./../../../../../opal/include -I./../../../../../ompi/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio/adio/include -I./include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio/include -I./mpi-io -I/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio/mpi-io -I./adio/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio/adio/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/hwloc/hwloc1117/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -mcx16 -pthread -D__EXTENSIONS__ -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -DHAVE_ROMIOCONF_H -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -I./include -MT mpi-io/delete.lo -MD -MP -MF $depbase.Tpo -c -o mpi-io/delete.lo mpi-io/delete.c
make[4]: *** [mpi-io/delete.lo] Error 139
make[4]: *** Waiting for unfinished jobs....
make[4]: Leaving directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314/romio'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mca/io/romio314'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi'
make: *** [all-recursive] Error 1
jsquyres commented 6 years ago

Thanks for reporting the issue.

It looks like your compiler seg faulted while building the file ompi/mca/io/romio314/romio/mpi-io/delete.c.

This is not a problem with Open MPI per se, but rather a problem with your compiler.

Interestingly enough, this file did not change between the Open MPI 3.1.2 and 3.1.3 releases:

$ wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.3.tar.bz2
...
$ tar xf openmpi-3.1.3.tar.bz2
$ wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.3.tar.bz2
...
$ tar xf openmpi-3.1.2.tar.bz2
$ diff \
    openmpi-3.1.2/ompi/mca/io/romio314/romio/mpi-io/delete.c \
    openmpi-3.1.3/ompi/mca/io/romio314/romio/mpi-io/delete.c
$

That being said, it's possible/likely that some other header files that delete.c uses changed between 3.1.2 and 3.1.3. I.e., I'm sure something changed to make your compiler abort when building 3.1.3 and not when building 3.1.2.

You might want to investigate why your compiler seg faulted.

ontheklaud commented 6 years ago

Thanks for such a quick response. Here's some progess for building v3.1.3.

Miscellaneous

CUDA support: no PMIx support: internal

Transports

Cisco usNIC: no Cray uGNI (Gemini/Aries): no Intel Omnipath (PSM2): no Intel SCIF: no Intel TrueScale (PSM): no Mellanox MXM: no Open UCX: no OpenFabrics Libfabric: no OpenFabrics Verbs: no Portals4: no Shared memory/copy in+copy out: yes Shared memory/Linux CMA: yes Shared memory/Linux KNEM: no Shared memory/XPMEM: no TCP: yes

Resource Managers

Cray Alps: no Grid Engine: no LSF: no Moab: no Slurm: yes ssh/rsh: yes Torque: no

OMPIO File Systems

Generic Unix FS: yes Lustre: no PVFS2/OrangeFS: no


* Milestone: Strangely, I managed v3.1.3 source to build successfully, **only with single thread (-j1)**.

**[make -j1] Succeess**

(...) CC monitoring_test.o CCLD monitoring_test CC test_pvar_access.o CCLD test_pvar_access CC test_overhead.o CCLD test_overhead CC check_monitoring.o CCLD check_monitoring CC example_reduce_count.o CCLD example_reduce_count make[2]: Leaving directory /home/xo/pyenv-ngraph/openmpi-3.1.3/test/monitoring' make[2]: Entering directory/home/xo/pyenv-ngraph/openmpi-3.1.3/test' make[2]: Nothing to be done for all-am'. make[2]: Leaving directory/home/xo/pyenv-ngraph/openmpi-3.1.3/test' make[1]: Leaving directory /home/xo/pyenv-ngraph/openmpi-3.1.3/test' make[1]: Entering directory/home/xo/pyenv-ngraph/openmpi-3.1.3' make[1]: Nothing to be done for all-am'. make[1]: Leaving directory/home/xo/pyenv-ngraph/openmpi-3.1.3' $


**[make -j16] Failed: first attempt from discrete source extract**

(...) CC pwin_lock_all_f.lo CC pwin_post_f.lo CC pwin_set_attr_f.lo CC pwin_set_errhandler_f.lo CC pwin_set_info_f.lo CC pwin_set_name_f.lo CC pwin_shared_query_f.lo CC pwin_start_f.lo CC pwin_sync_f.lo CC pwin_test_f.lo /bin/sh: line 2: 4100 Segmentation fault /bin/sh ../../../../../libtool --silent --tag=CC --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../../../../../opal/include -I../../../../../ompi/include -I../../../../../oshmem/include -I../../../../../opal/mca/hwloc/hwloc1117/hwloc/include/private/autogen -I../../../../../opal/mca/hwloc/hwloc1117/hwloc/include/hwloc/autogen -I../../../../../ompi/mpiext/cuda/c -DOMPI_BUILD_MPI_PROFILING=1 -DOMPI_COMPILING_FORTRAN_WRAPPERS=1 -I../../../../.. -I../../../../../orte/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/hwloc/hwloc1117/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -mcx16 -pthread -MT pwin_start_f.lo -MD -MP -MF $depbase.Tpo -c -o pwin_start_f.lo pwin_start_f.c make[3]: [pwin_start_f.lo] Error 139 make[3]: Waiting for unfinished jobs.... /bin/sh: line 2: 4085 Segmentation fault /bin/sh ../../../../../libtool --silent --tag=CC --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../../../../../opal/include -I../../../../../ompi/include -I../../../../../oshmem/include -I../../../../../opal/mca/hwloc/hwloc1117/hwloc/include/private/autogen -I../../../../../opal/mca/hwloc/hwloc1117/hwloc/include/hwloc/autogen -I../../../../../ompi/mpiext/cuda/c -DOMPI_BUILD_MPI_PROFILING=1 -DOMPI_COMPILING_FORTRAN_WRAPPERS=1 -I../../../../.. -I../../../../../orte/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/hwloc/hwloc1117/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -mcx16 -pthread -MT pwin_shared_query_f.lo -MD -MP -MF $depbase.Tpo -c -o pwin_shared_query_f.lo pwin_shared_query_f.c make[3]: [pwin_shared_query_f.lo] Error 139 make[3]: Leaving directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mpi/fortran/mpif-h/profile' make[2]: [all-recursive] Error 1 make[2]: Leaving directory /home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mpi/fortran/mpif-h' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi'


**[make -j16] Failed: second attempt from discrete source extract**

(...) CC ptype_get_contents_f.lo CC ptype_get_envelope_f.lo CC ptype_get_extent_f.lo CC ptype_get_extent_x_f.lo CC ptype_get_name_f.lo CC ptype_get_true_extent_f.lo CC ptype_get_true_extent_x_f.lo CC ptype_hindexed_f.lo CC ptype_hvector_f.lo CC ptype_indexed_f.lo CC ptype_lb_f.lo CC ptype_match_size_f.lo CC ptype_set_attr_f.lo CC ptype_set_name_f.lo CC ptype_size_f.lo /bin/sh: line 2: 19489 Segmentation fault /bin/sh ../../../../../libtool --silent --tag=CC --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../../../../../opal/include -I../../../../../ompi/include -I../../../../../oshmem/include -I../../../../../opal/mca/hwloc/hwloc1117/hwloc/include/private/autogen -I../../../../../opal/mca/hwloc/hwloc1117/hwloc/include/hwloc/autogen -I../../../../../ompi/mpiext/cuda/c -DOMPI_BUILD_MPI_PROFILING=1 -DOMPI_COMPILING_FORTRAN_WRAPPERS=1 -I../../../../.. -I../../../../../orte/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/event/libevent2022/libevent/include -I/home/xo/pyenv-ngraph/openmpi-3.1.3/opal/mca/hwloc/hwloc1117/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -mcx16 -pthread -MT ptype_lb_f.lo -MD -MP -MF $depbase.Tpo -c -o ptype_lb_f.lo ptype_lb_f.c make[3]: [ptype_lb_f.lo] Error 139 make[3]: Waiting for unfinished jobs.... make[3]: Leaving directory /home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mpi/fortran/mpif-h/profile' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi/mpi/fortran/mpif-h' make[1]: [all-recursive] Error 1 make[1]: Leaving directory `/home/xo/pyenv-ngraph/openmpi-3.1.3/ompi' make: [all-recursive] Error 1



Here's my opinion.
- there is no issue with building v3.1.3 in my DevEnv, **only with single threaded make (-j1)**
- my previous gcc maybe not a problem
- **multi-threaded make (e.g. -j16)** may occur segfault while building the fresh source

Is there any procedure that I have to lookup?
Thanks!
jsquyres commented 6 years ago

There might be something else wrong with your OS installation -- e.g., do you have low memory and/or disk space? You might want to check places like /var/log/messages to see if any relevant error messages from gcc appeared there (E.g., they got killed because the OS ran out of RAM or something).

There's not much Open MPI can do if the compiler seg faults. ☹️

ontheklaud commented 6 years ago

I also agree with you 😿. I have to lookup my OS installation sometime. RAM was enough (64 GiB), disk was enough (Samsung 960 1TB, 100GB free space) Anyway, v3.1.2 was available to build with multi-threaded, and v3.1.3 was eventually built with single-thread.

If there is any issue 'highly' related to ompi, then I'll start new issue; now closing this issue. Again, thanks for your comment to lookup my workaround.