Open olagarde opened 9 months ago
you may be able to work around this by adding --disable-devel-check to the configure line
I don't think that will help - they aren't getting thru configure, and that flag only impacts compile.
Might be an issue with the OAC m4's, and just hitting it in PMIx first. Not sure of the order of processing in OMPI. Do you know which configure test is failing?
The toplevel output makes it through "Configuring PMIx" with one obvious error in that block:
checking whether -lc should be explicitly linked in...
checking dynamic linker characteristics... nvcc fatal : Unknown option '-print-search-dirs'
GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
Then at "C compiler and preprocessor" it gets down to:
...
checking for nvcc options needed to detect all undeclared functions... cannot detect
configure: error: in `/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix':
configure: error: cannot make nvcc report undeclared builtins
See `config.log' for more details
configure: ===== done with 3rd-party/openpmix configure =====
configure: error: Could not find viable pmix build.
The print-search-dirs is a GCC-ism so I'm assuming that's ok. If so, the caller immediately before the error is this block in configure, starting at line 18028:
# The cast to long int works around a bug in the HP C Compiler,
# see AC_CHECK_SIZEOF for more information.
{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking alignment of float" >&5
printf %s "checking alignment of float... " >&6; }
if test ${ac_cv_alignof_float+y}
then :
printf %s "(cached) " >&6
else $as_nop
if ac_fn_c_compute_int "$LINENO" "(long int) offsetof (ac__type_alignof_, y)" "ac_cv_alignof_float" "$ac_includes_default
#include <stdbool.h>
That's the line reference in the 3rd-party/openpmix/config.log, anyway. As for the original 3rd-party/openpmix/configure, that starts at line 17962 with
{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for $CC options needed to detect all undeclared functions" >&5
printf %s "checking for $CC options needed to detect all undeclared functions... " >&6; }
if test ${ac_cv_c_undeclared_builtin_options+y}
then :
printf %s "(cached) " >&6
else $as_nop
ac_save_CFLAGS=$CFLAGS
ac_cv_c_undeclared_builtin_options='cannot detect'
for ac_arg in '' -fno-builtin; do
CFLAGS="$ac_save_CFLAGS $ac_arg"
# This test program should *not* compile successfully.
cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
int
main (void)
{
(void) strchr;
;
return 0;
}
_ACEOF
if ac_fn_c_try_compile "$LINENO"
then :
else $as_nop
# This test program should compile successfully.
# No library function is consistently available on
# freestanding implementations, so test against a dummy
# declaration. Include always-available headers on the
# off chance that they somehow elicit warnings.
cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <float.h>
#include <limits.h>
#include <stdarg.h>
#include <stddef.h>
extern void ac_decl (int, char *);
int
main (void)
{
(void) ac_decl (0, (char *) 0);
(void) ac_decl;
;
return 0;
}
_ACEOF
No ideas - I don't know where this hits in configure, but it certainly isn't in anything I'm familiar with 🤷♂️
Toplevel configure starts at line 18028, 3rd-party/openpmix/configure starts at line 17962. Stub is 'ac_cv_c_undeclared_builtin_options'.
I get a different problem on our gpu cluster
configure: WARNING: PMIx requires a C99 (or newer) compiler. C11 is recommended.
configure: error: Aborting.
this is with hpc_sdk 22.7.
What's the nvcc argument for requesting c11 or newer? the nvcc man page doesn't seem very useful for getting an answer to this question.
@hppritcha, you can try -std=c++11, but I haven't tried OpenMPI 4.x or 5.x with that old of an NVHPC bundle. Nvidia release cadence is pretty fast and a lot changed between 22 and 23. Personally I'd start with the last OMPI 4.x and NVHPC 23.x.
Further updates in NVHPC forums ticket, ruling out NVidia question of CC=nvc vs CC=nvcc as a possible culprit. The former is the NVHPC C compiler but using this introduces a number of additional config, compile, or check errors for both 4.x and 5.x ompi. The latter is the CUDA C++ frontend, was advised in a prior NVidia issue we had supporting ompi 4.x and nvhpc 22.x, and still works with all nvhpc 23.x builds of ompi 4.x and 5.x [with the exception of the 5.x issue stated here].
NVHPC [Employee] cparrot replicate the OP with -Wno-unused-parameter (unsupported by NVHPC compilers), see:
Configure option --enable-devel-check defaults to no, if "no" sets WANT_PICKY_COMPILER=0 and adds -Wno-unused-parameter to CFLAGS. Adding --enable-devel-check to the config in the OP gets configure to succeed [with pmix=internal], compile then fails with
make[3]: Entering directory '/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix/src'
Making all in include
make[4]: Entering directory '/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix/src/include'
make all-am
make[5]: Entering directory '/tmp/openmpi-nvhpc/openmpi-5.0.1/build/3rd-party/openpmix/src/include'
CC pmix_globals.lo
nvcc fatal : Value '-MT' is not defined for option 'Werror'
make[5]: *** [Makefile:808: pmix_globals.lo] Error 1
That's openmpi-5.0.1/build/3rd-party/openpmix/src/include/Makefile:808-811, the only use of $LTCOMPILE:
.c.lo:
$(AM_V_CC)depbase=`echo $@ | sed 's|[^/]*$$|$(DEPDIR)/&|;s|\.lo$$||'`;\
$(LTCOMPILE) -MT $@ -MD -MP -MF $$depbase.Tpo -c -o $@ $< &&\
$(am__mv) $$depbase.Tpo $$depbase.Plo
The -MT arg is dependency-target-name
Testing the other side of the --enable-devel-check block, using --disable-devel-check and manually editing to remove the addition of -Wno-unused-parameter at:
while leaving WANT_PICKY_COMPILER=0 results in configure running and setting up the internal pmix, then the build dies with several hundred errors like
NVFORTRAN-W-0031-Illegal data type length specifier for complex (sizeof_f08.f90: 151)
NVFORTRAN-W-0031-Illegal data type length specifier for x (sizeof_f08.f90: 151)NVFORTRAN-W-0031-Illegal data type length specifier for complex (sizeof_f08.f90: 151)
NVFORTRAN-W-0031-Illegal data type length specifier for x (sizeof_f08.f90: 151)
and subsequent -lcudart failures.
Here's two combinations that appear to work, at least as far as build, check, and simplistic hybrid mpi/mp batch jobs go (homebrew jacobian matrix calc and stock HYCOM as benchmarks):
There are several hundred instances of things like
"../../../test/datatype/ddt_pack.c", line 250: warning: transfer of control bypasses initialization of: [branch_past_initialization]
variable "type" (declared at line 413)
if (ret != 0) goto cleanup;
and
"../../../test/datatype/ddt_raw2.c", line 234: warning: integer conversion resulted in a change of sign [integer_sign_change]
{ .loop = { { 16, 0}, 2, 3, -1, 16} },
for the ompi 5.x and nvhpc 23.x (nvc, forced -fPIC) that don't occur elsewhere. This can mask edge condition errors but (a) AFAICT there aren't any errors, the successful tests true positives; (b) these only occur in the testcases so ... meh?
I have removed the -Wno-unused-parameters
from PMIx when --disable-devel-check
is in effect. Note that the devel-check is only automatically enabled when in a Git clone - it is not active in a tarball.
Can't help with the other problems 🤷♂️
The OpenPMIx issue should be resolved when the Open MPI v5.0.x submodule pointer advances to v4.2.9 or beyond.
To continue this issue (it's gotten quite complicated), it would be good to see results from after the OpenPMIx submodule pointer is advanced -- i.e., see what that fixes and what is left to be addressed.
@wenduwan @janjust Is there a timeline for when the v5.0.x submodule pointers will be advanced?
This issue also reported at the NVHPC forums It’s unclear whether this is nvcc in NVHPC 23.1 and 23.11 or the OpenMPI 5.x configuration method for getting the compiler to report undeclared builtins.
Background information
5.0.0 and 5.0.1 configure test for PMIx no longer works with NVHPC, appears to be the compiler check for undeclared builtins. The associated stub compiles and runs but does not produce output, which is interpreted as inability to get the compiler to report undeclared builtins, halting the configure since pmix is required as of OMP 5.x. GCC through 13.2.0 does not have this issue. OMPI 4.x pmix test stub differs (orted vs prrte?) and does not have this issue for GCC 12.2.0 / 13.2.0 or NVHPC 23.1 / 23.11. Currently unable to try OMP 5.0.2 or external pmix/hwloc/libevent recent enough for pmix>=4.2 (policy). Can anyone please verify working OMP 5.x build with NVHPC 23.11 (cuda 12.3) or 23.1 (cuda 12.0)?
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
4.1.4 (no issue) 4.1.6 (no issue) 5.0.0 (has issue) 5.0.1 (has issue) 5.0.2rc1 TBD
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Source build using NVHPC 23.1 (CUDA 12.0) and 23.11 (CUDA 12.3), CC=nvcc, FC=nvfortran, CXX=nvc++ Configure is scripted, script block is:
Please describe the system on which you are running
Details of the problem
Configure fails at PMIx checking:
The outer config.log for this failure is:
The inner config.log (/3rd-party/openpmix/config.log) for this failure is:
The stub compiles and runs but does not produce output under nvcc in NVHPC 23.1 or 23.11.