open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.07k stars 844 forks source link

nvcc undeclared builtins reporting failure -- configure test method for PMIx? #12277

Open olagarde opened 5 months ago

olagarde commented 5 months ago

This issue also reported at the NVHPC forums It’s unclear whether this is nvcc in NVHPC 23.1 and 23.11 or the OpenMPI 5.x configuration method for getting the compiler to report undeclared builtins.


Background information

5.0.0 and 5.0.1 configure test for PMIx no longer works with NVHPC, appears to be the compiler check for undeclared builtins. The associated stub compiles and runs but does not produce output, which is interpreted as inability to get the compiler to report undeclared builtins, halting the configure since pmix is required as of OMP 5.x. GCC through 13.2.0 does not have this issue. OMPI 4.x pmix test stub differs (orted vs prrte?) and does not have this issue for GCC 12.2.0 / 13.2.0 or NVHPC 23.1 / 23.11. Currently unable to try OMP 5.0.2 or external pmix/hwloc/libevent recent enough for pmix>=4.2 (policy). Can anyone please verify working OMP 5.x build with NVHPC 23.11 (cuda 12.3) or 23.1 (cuda 12.0)?

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

4.1.4 (no issue) 4.1.6 (no issue) 5.0.0 (has issue) 5.0.1 (has issue) 5.0.2rc1 TBD

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Source build using NVHPC 23.1 (CUDA 12.0) and 23.11 (CUDA 12.3), CC=nvcc, FC=nvfortran, CXX=nvc++ Configure is scripted, script block is:

            ...<archive copy/unroll, builddir create, set $distro and $basearch, etc>...
        module purge
        module load nvhpc/23.11
            ...<setup ./pbs-config for --with-tm>...
        export CFLAGS=''
        export FCFLAGS=''
        ../configure \
            --prefix=/opt/soft/$distro/$basearch/openmpi/5.0.1/nvhpc/23.11 \
            --x-includes=/usr/include \
            --x-libraries=/usr/lib64 \
            --enable-branch-probabilities \
            --enable-dependency-tracking \
            --enable-mpi-ext=all \
            --with-pmix=internal \
            --enable-pmix-timing \
            --with-package-string="Open MPI 5.0.1 with NVHPC 23.11" \
            --with-ident-string="NRLDC CCS" \
            --enable-ipv6 \
            --enable-heterogeneous \
            --enable-hwloc-pci \
            --with-hwloc=internal \
            --with-ofi \
            --with-verbs \
            --with-tm="$PBS_EXEC" \
            --enable-sparse-groups \
            --enable-peruse \
            --enable-mpi-fortran=all \
            CC=nvcc \
            FC=nvfortran \
            CXX=nvc++
            # --enable-mpi-cxx \ # (C++ bindings no longer supportd)
            # --enable-mpi-cxx-seek \ # (C++ bindings no longer supportd)

Please describe the system on which you are running


Details of the problem

Configure fails at PMIx checking:

checking for nvcc options needed to detect all undeclared functions... cannot detect
configure: error: in `/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix':
configure: error: cannot make nvcc report undeclared builtins

The outer config.log for this failure is:

configure:5795: *** Configuring PMIx
configure:63782: ===== configuring 3rd-party/openpmix =====
configure:63971: running /bin/sh ../../../3rd-party/openpmix/configure --disable-option-checking '--prefix=/opt/soft/el8/aarch64/openmpi/5.0.1/nvhpc/23.11' --without-tests-examples --enable-pmix-binaries --disable-pmix-backward-compatibility --disable-visibility --disable-hwloc-lib-checks --with-hwloc-extra-libs="/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/hwloc-2.7.1/hwloc/libhwloc.la" '--x-includes=/usr/include' '--x-libraries=/usr/lib64' '--enable-branch-probabilities' '--enable-dependency-tracking' '--enable-mpi-ext=all' '--enable-pmix-timing' '--with-package-string=Open MPI 5.0.1 with NVHPC 23.11' '--with-ident-string=NRLDC CCS' '--enable-ipv6' '--enable-heterogeneous' '--enable-hwloc-pci' '--with-ofi' '--with-verbs' '--with-tm=' '--enable-sparse-groups' '--enable-peruse' '--enable-mpi-fortran=all' 'CC=nvcc' 'CFLAGS=' 'CPPFLAGS=-I/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/hwloc-2.7.1/include -I/tmp/openmpi-nvhpc/openmpi-5.0.1/3rd-party/hwloc-2.7.1/include' 'CXX=nvc++' 'FC=nvfortran' 'FCFLAGS=' 'CPP=cpp' 'PKG_CONFIG_PATH=/opt/soft/el8/aarch64/ucx/1.13.1/lib/pkgconfig:/opt/soft/el8/aarch64/openssl/1.1.1s/lib/pkgconfig' --cache-file=/dev/null --srcdir=../../../3rd-party/openpmix
configure:63991: ===== done with 3rd-party/openpmix configure =====
configure:65532: error: Could not find viable pmix build.

The inner config.log (/3rd-party/openpmix/config.log) for this failure is:

...snip snip...
| /* end confdefs.h.  */
| #include <float.h>
| #include <limits.h>
| #include <stdarg.h>
| #include <stddef.h>
| extern void ac_decl (int, char *);
|
| int
| main (void)
| {
| (void) ac_decl (0, (char *) 0);
|   (void) ac_decl;
|
|   ;
|   return 0;
| }
configure:18028: result: cannot detect
configure:18032: error: in `/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix':
configure:18034: error: cannot make nvcc report undeclared builtins

The stub compiles and runs but does not produce output under nvcc in NVHPC 23.1 or 23.11.

hppritcha commented 5 months ago

you may be able to work around this by adding --disable-devel-check to the configure line

rhc54 commented 5 months ago

I don't think that will help - they aren't getting thru configure, and that flag only impacts compile.

Might be an issue with the OAC m4's, and just hitting it in PMIx first. Not sure of the order of processing in OMPI. Do you know which configure test is failing?

olagarde commented 5 months ago

The toplevel output makes it through "Configuring PMIx" with one obvious error in that block:

checking whether -lc should be explicitly linked in... 
checking dynamic linker characteristics... nvcc fatal   : Unknown option '-print-search-dirs'
GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate

Then at "C compiler and preprocessor" it gets down to:

    ...
checking for nvcc options needed to detect all undeclared functions... cannot detect
configure: error: in `/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix':
configure: error: cannot make nvcc report undeclared builtins
See `config.log' for more details
configure: ===== done with 3rd-party/openpmix configure =====
configure: error: Could not find viable pmix build.

The print-search-dirs is a GCC-ism so I'm assuming that's ok. If so, the caller immediately before the error is this block in configure, starting at line 18028:

# The cast to long int works around a bug in the HP C Compiler,
# see AC_CHECK_SIZEOF for more information.
{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking alignment of float" >&5
printf %s "checking alignment of float... " >&6; }
if test ${ac_cv_alignof_float+y}
then :
  printf %s "(cached) " >&6
else $as_nop
  if ac_fn_c_compute_int "$LINENO" "(long int) offsetof (ac__type_alignof_, y)" "ac_cv_alignof_float"        "$ac_includes_default
                            #include <stdbool.h>

That's the line reference in the 3rd-party/openpmix/config.log, anyway. As for the original 3rd-party/openpmix/configure, that starts at line 17962 with

{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for $CC options needed to detect all undeclared functions" >&5
printf %s "checking for $CC options needed to detect all undeclared functions... " >&6; }
if test ${ac_cv_c_undeclared_builtin_options+y}
then :
  printf %s "(cached) " >&6
else $as_nop
  ac_save_CFLAGS=$CFLAGS
   ac_cv_c_undeclared_builtin_options='cannot detect'
   for ac_arg in '' -fno-builtin; do
     CFLAGS="$ac_save_CFLAGS $ac_arg"
     # This test program should *not* compile successfully.
     cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h.  */

int
main (void)
{
(void) strchr;
  ;
  return 0;
}
_ACEOF
if ac_fn_c_try_compile "$LINENO"
then :

else $as_nop
  # This test program should compile successfully.
        # No library function is consistently available on
        # freestanding implementations, so test against a dummy
        # declaration.  Include always-available headers on the
        # off chance that they somehow elicit warnings.
        cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h.  */
#include <float.h>
#include <limits.h>
#include <stdarg.h>
#include <stddef.h>
extern void ac_decl (int, char *);

int
main (void)
{
(void) ac_decl (0, (char *) 0);
  (void) ac_decl;

  ;
  return 0;
}
_ACEOF
rhc54 commented 5 months ago

No ideas - I don't know where this hits in configure, but it certainly isn't in anything I'm familiar with 🤷‍♂️

olagarde commented 5 months ago

Toplevel configure starts at line 18028, 3rd-party/openpmix/configure starts at line 17962. Stub is 'ac_cv_c_undeclared_builtin_options'.

hppritcha commented 5 months ago

I get a different problem on our gpu cluster

configure: WARNING: PMIx requires a C99 (or newer) compiler. C11 is recommended.
configure: error: Aborting.

this is with hpc_sdk 22.7.

What's the nvcc argument for requesting c11 or newer? the nvcc man page doesn't seem very useful for getting an answer to this question.

olagarde commented 5 months ago

@hppritcha, you can try -std=c++11, but I haven't tried OpenMPI 4.x or 5.x with that old of an NVHPC bundle. Nvidia release cadence is pretty fast and a lot changed between 22 and 23. Personally I'd start with the last OMPI 4.x and NVHPC 23.x.

olagarde commented 5 months ago

Further updates in NVHPC forums ticket, ruling out NVidia question of CC=nvc vs CC=nvcc as a possible culprit. The former is the NVHPC C compiler but using this introduces a number of additional config, compile, or check errors for both 4.x and 5.x ompi. The latter is the CUDA C++ frontend, was advised in a prior NVidia issue we had supporting ompi 4.x and nvhpc 22.x, and still works with all nvhpc 23.x builds of ompi 4.x and 5.x [with the exception of the 5.x issue stated here].

olagarde commented 5 months ago

NVHPC [Employee] cparrot replicate the OP with -Wno-unused-parameter (unsupported by NVHPC compilers), see:

Configure option --enable-devel-check defaults to no, if "no" sets WANT_PICKY_COMPILER=0 and adds -Wno-unused-parameter to CFLAGS. Adding --enable-devel-check to the config in the OP gets configure to succeed [with pmix=internal], compile then fails with

make[3]: Entering directory '/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix/src'
Making all in include
make[4]: Entering directory '/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix/src/include'
make  all-am
make[5]: Entering directory '/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix/src/include'
  CC       pmix_globals.lo
nvcc fatal   : Value '-MT' is not defined for option 'Werror'
make[5]: *** [Makefile:808: pmix_globals.lo] Error 1

That's openmpi-5.0.1/build/3rd-party/openpmix/src/include/Makefile:808-811, the only use of $LTCOMPILE:

.c.lo:
        $(AM_V_CC)depbase=`echo $@ | sed 's|[^/]*$$|$(DEPDIR)/&|;s|\.lo$$||'`;\
        $(LTCOMPILE) -MT $@ -MD -MP -MF $$depbase.Tpo -c -o $@ $< &&\
        $(am__mv) $$depbase.Tpo $$depbase.Plo

The -MT arg is dependency-target-name for nvcc just like gcc, so ... expansion issue elsewhere upstream from this rule?

olagarde commented 5 months ago

Testing the other side of the --enable-devel-check block, using --disable-devel-check and manually editing to remove the addition of -Wno-unused-parameter at:

while leaving WANT_PICKY_COMPILER=0 results in configure running and setting up the internal pmix, then the build dies with several hundred errors like

NVFORTRAN-W-0031-Illegal data type length specifier for complex (sizeof_f08.f90: 151)
NVFORTRAN-W-0031-Illegal data type length specifier for x (sizeof_f08.f90: 151)NVFORTRAN-W-0031-Illegal data type length specifier for complex (sizeof_f08.f90: 151)
NVFORTRAN-W-0031-Illegal data type length specifier for x (sizeof_f08.f90: 151)

and subsequent -lcudart failures.

olagarde commented 5 months ago

Here's two combinations that appear to work, at least as far as build, check, and simplistic hybrid mpi/mp batch jobs go (homebrew jacobian matrix calc and stock HYCOM as benchmarks):

There are several hundred instances of things like

"../../../test/datatype/ddt_pack.c", line 250: warning: transfer of control bypasses initialization of: [branch_past_initialization]
            variable "type" (declared at line 413)
      if (ret != 0) goto cleanup;

and

"../../../test/datatype/ddt_raw2.c", line 234: warning: integer conversion resulted in a change of sign [integer_sign_change]
          { .loop = { { 16, 0}, 2, 3, -1, 16} },

for the ompi 5.x and nvhpc 23.x (nvc, forced -fPIC) that don't occur elsewhere. This can mask edge condition errors but (a) AFAICT there aren't any errors, the successful tests true positives; (b) these only occur in the testcases so ... meh?

rhc54 commented 5 months ago

I have removed the -Wno-unused-parameters from PMIx when --disable-devel-check is in effect. Note that the devel-check is only automatically enabled when in a Git clone - it is not active in a tarball.

Can't help with the other problems 🤷‍♂️

jsquyres commented 4 months ago

The OpenPMIx issue should be resolved when the Open MPI v5.0.x submodule pointer advances to v4.2.9 or beyond.

To continue this issue (it's gotten quite complicated), it would be good to see results from after the OpenPMIx submodule pointer is advanced -- i.e., see what that fixes and what is left to be addressed.

@wenduwan @janjust Is there a timeline for when the v5.0.x submodule pointers will be advanced?