open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

PGF90-S-0081-Illegal selector - KIND value must be non-negative with PGI 18.10 and OpenMPI master (v4.0.x) #6243

Open azrael417 opened 5 years ago

azrael417 commented 5 years ago

Thank you for taking the time to submit an issue!

Background information

I am trying to compile OpenMPI with UCX support and run into issues when trying to run make install on the OpenMPI makefile when compiling with PGI.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

The version I am using is the current master branch, commit 748d8b6b4bd644cfa9dc8ceb024b066d99858d73

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Here is the install script

#compiler settings
    #Download from Repository
    git clone https://github.com/open-mpi/ompi.git

    #Build Library
    #compiler settings
    export CC=gcc
    export CXX=g++
    export FC=gfortran
    export F90=gfortran
    #install
    export OMPI_PREFIX=$(pwd)/install_pgi/ompi
    pushd ompi
    make clean
    make distclean
    export MPI_HOME=${OMPI_PREFIX}
    export CXXFLAGS="-march=native"
    export CFLAGS="-march=native"
    export FCFLAGS="-march=native"
    ./autogen.pl
    ./configure --prefix=${OMPI_PREFIX} --enable-mpirun-prefix-by-default --with-cuda=${CUDA_HOME} --with-ucx=${UCX_HOME}
    make -j8 install
    popd

In this version I compiled with gcc and then hacked the compiler wrapper descriptors to work with PGI:

sed -i 's|compiler=gfortran|compiler=pgfortran|g;s|-pthread||g' ${OMPI_PREFIX}/share/openmpi/mpif90-wrapper-data.txt
sed -i 's|compiler=gcc|compiler=pgcc|g;s|-pthread||g' ${OMPI_PREFIX}/share/openmpi/mpicc-wrapper-data.txt
sed -i 's|compiler=g++|compiler=pg++|g;s|-pthread||g' ${OMPI_PREFIX}/share/openmpi/mpic++-wrapper-data.txt

In that case, the error specified above will occur when another app is compiled with pgi against this ompi.

Please describe the system on which you are running

This is the OS info

NAME="openSUSE Leap"
VERSION="42.3"
ID=opensuse
ID_LIKE="suse"
VERSION_ID="42.3"
PRETTY_NAME="openSUSE Leap 42.3"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:42.3"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

I guess network type etc is not very important for this bug, but the system is essential a Cray CS Storm system.

https://www.cray.com/products/computing/cs-series/cs-storm


Details of the problem

As a reproducer, try to run the above script with pgi 18.10 compiler and the following modifications:

#compiler settings
    #Download from Repository
    git clone https://github.com/open-mpi/ompi.git

    #Build Library
    #compiler settings
    export CC=pgcc
    export CXX=pg++
    export FC=pgfortran
    export F90=pgf90
    #install
    export OMPI_PREFIX=$(pwd)/install_pgi/ompi
    pushd ompi
    make clean
    make distclean
    export MPI_HOME=${OMPI_PREFIX}
    export CXXFLAGS="-march=native"
    export CFLAGS="-march=native"
    export FCFLAGS="-march=native"
    ./autogen.pl
    ./configure --prefix=${OMPI_PREFIX} --enable-mpirun-prefix-by-default --with-cuda=${CUDA_HOME} --with-ucx=${UCX_HOME}
    make -j8 install
    popd

It will fail in the install stage with the error mentioned above.

If you need more information, please let me know. Maybe I am missing an essential setting as well. I think that the issue is OpenMPI related and not UCX, thus I did not provide the UCX build info. I can add that upon request though.

Karl-JT commented 5 years ago

Hi, I have the same issue with my pgi compiler and openMPI. Is this already solved?

hjelmn commented 5 years ago

Wait, are you mixing fortran compilers? That is a big no-no.

Fortran is a total PIA and needs to go away. You must build Open MPI with each compiler and each version as there is no guarantee of compatibility even within compilers.

Also, why UCX on Cray? You will get much better performance with the native support in Open MPI.

azrael417 commented 5 years ago

On a DGX-1 we have seen that UCX gets much better performance than mvapich or mpich. For the Storm system we are looking at it might actually help, especially it is supposed to make better use of nvlink. Where am I mixing fortran compilers? The issue occurs if you compile OpenMPI with pgfortran/pgf90 in the install step.

hjelmn commented 5 years ago

Ah, so you are using send/recv on GPU buffers? Haven't bothered with that for the native uGNI support as we don't have a GPU-enabled Cray to test on.

Looking at your install script I clearly see Open MPI built with gfortran not pgfortran. That would mean the bindings are built for gfortran not pgfortran.

hjelmn commented 5 years ago

Oh I see. You have two scripts.

azrael417 commented 5 years ago

Please look at the second script. In the first one I compiled with GNU and then changed the compilers in the wrapper txt files, in the second attempt I tried building natively. The second one works perfectly for 3.1.x, but not for the master branch.

hjelmn commented 5 years ago

Can you give the complete error?

hjelmn commented 5 years ago

@azrael417 Yeah, don't do the first one. That will not work. You can't fix pgfortan and gfortran. The second one should work so there is definitely a problem there. Though it could be in pgi or Open MPI.

azrael417 commented 5 years ago

I will reproduce the complete error, that takes a little, please hang on.

hjelmn commented 5 years ago

I can't debug directly as we no longer pay for PGI on our Cray systems.

azrael417 commented 5 years ago

Interesting, now I get a memkind error. I think I have been that before but don't know how I worked around it

  CC       mpool_memkind_component.lo
  CC       mpool_memkind_module.lo
PGC-S-0043-Redefinition of symbol, memkind_memtype_t (/usr/include/memkind.h: 44)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_DEFAULT (/usr/include/memkind.h: 49)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_DEFAULT (/usr/include/memkind.h: 49)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_HIGH_BANDWIDTH (/usr/include/memkind.h: 58)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_HIGH_BANDWIDTH (/usr/include/memkind.h: 58)
PGC-W-0114-More than one type specified (/usr/include/memkind.h: 58)
PGC-W-0143-Useless typedef declaration (no declarators present) (/usr/include/memkind.h: 58)
PGC-S-0043-Redefinition of symbol, memkind_policy_t (/usr/include/memkind.h: 64)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_BIND_LOCAL (/usr/include/memkind.h: 71)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_BIND_ALL (/usr/include/memkind.h: 78)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_BIND_ALL (/usr/include/memkind.h: 78)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_PREFERRED_LOCAL (/usr/include/memkind.h: 86)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_PREFERRED_LOCAL (/usr/include/memkind.h: 86)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_LOCAL (/usr/include/memkind.h: 93)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_LOCAL (/usr/include/memkind.h: 93)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_ALL (/usr/include/memkind.h: 100)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_ALL (/usr/include/memkind.h: 100)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_MAX_VALUE (/usr/include/memkind.h: 107)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_MAX_VALUE (/usr/include/memkind.h: 107)
PGC-W-0114-More than one type specified (/usr/include/memkind.h: 107)
PGC-W-0143-Useless typedef declaration (no declarators present) (/usr/include/memkind.h: 107)
PGC-S-0043-Redefinition of symbol, memkind_bits_t (/usr/include/memkind.h: 116)
PGC-S-0043-Redefinition of symbol, MEMKIND_MASK_PAGE_SIZE_2MB (/usr/include/memkind.h: 119)
PGC-S-0043-Redefinition of symbol, MEMKIND_MASK_PAGE_SIZE_2MB (/usr/include/memkind.h: 119)
PGC-W-0114-More than one type specified (/usr/include/memkind.h: 120)
PGC-W-0143-Useless typedef declaration (no declarators present) (/usr/include/memkind.h: 120)
PGC-W-0043-Redefinition of symbol, memkind_t (/usr/include/memkind.h: 123)
PGC-S-0043-Redefinition of symbol, memkind_const (/usr/include/memkind.h: 127)
PGC-S-0043-Redefinition of symbol, MEMKIND_MAX_KIND (/usr/include/memkind.h: 128)
PGC-S-0043-Redefinition of symbol, MEMKIND_ERROR_MESSAGE_SIZE (/usr/include/memkind.h: 130)
PGC-S-0043-Redefinition of symbol, MEMKIND_SUCCESS (/usr/include/memkind.h: 136)
PGC-S-0043-Redefinition of symbol, MEMKIND_ERROR_UNAVAILABLE (/usr/include/memkind.h: 137)
PGC-F-0008-Error limit exceeded (/usr/include/memkind.h: 137)
PGC/x86-64 Linux 18.10-1: compilation aborted
Makefile:1871: recipe for target 'mpool_memkind_component.lo' failed
make[2]: *** [mpool_memkind_component.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/global/u2/t/tkurth/src/openmpi_ucx_repro/ompi/opal/mca/mpool/memkind'
Makefile:2367: recipe for target 'install-recursive' failed
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory '/global/u2/t/tkurth/src/openmpi_ucx_repro/ompi/opal'
Makefile:1885: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1

I can confirm that 3.1.x builds without problems using the same script. This one now uses master at hook 748d8b6b4bd644cfa9dc8ceb024b066d99858d73.

azrael417 commented 5 years ago

Any update on that?

hjelmn commented 5 years ago

No idea why memkind is failing there. Just configure with --with-memkind=no