Open azrael417 opened 5 years ago
Hi, I have the same issue with my pgi compiler and openMPI. Is this already solved?
Wait, are you mixing fortran compilers? That is a big no-no.
Fortran is a total PIA and needs to go away. You must build Open MPI with each compiler and each version as there is no guarantee of compatibility even within compilers.
Also, why UCX on Cray? You will get much better performance with the native support in Open MPI.
On a DGX-1 we have seen that UCX gets much better performance than mvapich or mpich. For the Storm system we are looking at it might actually help, especially it is supposed to make better use of nvlink. Where am I mixing fortran compilers? The issue occurs if you compile OpenMPI with pgfortran/pgf90 in the install step.
Ah, so you are using send/recv on GPU buffers? Haven't bothered with that for the native uGNI support as we don't have a GPU-enabled Cray to test on.
Looking at your install script I clearly see Open MPI built with gfortran not pgfortran. That would mean the bindings are built for gfortran not pgfortran.
Oh I see. You have two scripts.
Please look at the second script. In the first one I compiled with GNU and then changed the compilers in the wrapper txt files, in the second attempt I tried building natively. The second one works perfectly for 3.1.x, but not for the master branch.
Can you give the complete error?
@azrael417 Yeah, don't do the first one. That will not work. You can't fix pgfortan and gfortran. The second one should work so there is definitely a problem there. Though it could be in pgi or Open MPI.
I will reproduce the complete error, that takes a little, please hang on.
I can't debug directly as we no longer pay for PGI on our Cray systems.
Interesting, now I get a memkind error. I think I have been that before but don't know how I worked around it
CC mpool_memkind_component.lo
CC mpool_memkind_module.lo
PGC-S-0043-Redefinition of symbol, memkind_memtype_t (/usr/include/memkind.h: 44)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_DEFAULT (/usr/include/memkind.h: 49)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_DEFAULT (/usr/include/memkind.h: 49)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_HIGH_BANDWIDTH (/usr/include/memkind.h: 58)
PGC-S-0043-Redefinition of symbol, MEMKIND_MEMTYPE_HIGH_BANDWIDTH (/usr/include/memkind.h: 58)
PGC-W-0114-More than one type specified (/usr/include/memkind.h: 58)
PGC-W-0143-Useless typedef declaration (no declarators present) (/usr/include/memkind.h: 58)
PGC-S-0043-Redefinition of symbol, memkind_policy_t (/usr/include/memkind.h: 64)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_BIND_LOCAL (/usr/include/memkind.h: 71)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_BIND_ALL (/usr/include/memkind.h: 78)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_BIND_ALL (/usr/include/memkind.h: 78)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_PREFERRED_LOCAL (/usr/include/memkind.h: 86)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_PREFERRED_LOCAL (/usr/include/memkind.h: 86)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_LOCAL (/usr/include/memkind.h: 93)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_LOCAL (/usr/include/memkind.h: 93)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_ALL (/usr/include/memkind.h: 100)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_INTERLEAVE_ALL (/usr/include/memkind.h: 100)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_MAX_VALUE (/usr/include/memkind.h: 107)
PGC-S-0043-Redefinition of symbol, MEMKIND_POLICY_MAX_VALUE (/usr/include/memkind.h: 107)
PGC-W-0114-More than one type specified (/usr/include/memkind.h: 107)
PGC-W-0143-Useless typedef declaration (no declarators present) (/usr/include/memkind.h: 107)
PGC-S-0043-Redefinition of symbol, memkind_bits_t (/usr/include/memkind.h: 116)
PGC-S-0043-Redefinition of symbol, MEMKIND_MASK_PAGE_SIZE_2MB (/usr/include/memkind.h: 119)
PGC-S-0043-Redefinition of symbol, MEMKIND_MASK_PAGE_SIZE_2MB (/usr/include/memkind.h: 119)
PGC-W-0114-More than one type specified (/usr/include/memkind.h: 120)
PGC-W-0143-Useless typedef declaration (no declarators present) (/usr/include/memkind.h: 120)
PGC-W-0043-Redefinition of symbol, memkind_t (/usr/include/memkind.h: 123)
PGC-S-0043-Redefinition of symbol, memkind_const (/usr/include/memkind.h: 127)
PGC-S-0043-Redefinition of symbol, MEMKIND_MAX_KIND (/usr/include/memkind.h: 128)
PGC-S-0043-Redefinition of symbol, MEMKIND_ERROR_MESSAGE_SIZE (/usr/include/memkind.h: 130)
PGC-S-0043-Redefinition of symbol, MEMKIND_SUCCESS (/usr/include/memkind.h: 136)
PGC-S-0043-Redefinition of symbol, MEMKIND_ERROR_UNAVAILABLE (/usr/include/memkind.h: 137)
PGC-F-0008-Error limit exceeded (/usr/include/memkind.h: 137)
PGC/x86-64 Linux 18.10-1: compilation aborted
Makefile:1871: recipe for target 'mpool_memkind_component.lo' failed
make[2]: *** [mpool_memkind_component.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/global/u2/t/tkurth/src/openmpi_ucx_repro/ompi/opal/mca/mpool/memkind'
Makefile:2367: recipe for target 'install-recursive' failed
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory '/global/u2/t/tkurth/src/openmpi_ucx_repro/ompi/opal'
Makefile:1885: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1
I can confirm that 3.1.x builds without problems using the same script. This one now uses master at hook 748d8b6b4bd644cfa9dc8ceb024b066d99858d73.
Any update on that?
No idea why memkind is failing there. Just configure with --with-memkind=no
Thank you for taking the time to submit an issue!
Background information
I am trying to compile OpenMPI with UCX support and run into issues when trying to run make install on the OpenMPI makefile when compiling with PGI.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
The version I am using is the current master branch, commit 748d8b6b4bd644cfa9dc8ceb024b066d99858d73
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Here is the install script
In this version I compiled with gcc and then hacked the compiler wrapper descriptors to work with PGI:
In that case, the error specified above will occur when another app is compiled with pgi against this ompi.
Please describe the system on which you are running
This is the OS info
I guess network type etc is not very important for this bug, but the system is essential a Cray CS Storm system.
https://www.cray.com/products/computing/cs-series/cs-storm
Details of the problem
As a reproducer, try to run the above script with pgi 18.10 compiler and the following modifications:
It will fail in the install stage with the error mentioned above.
If you need more information, please let me know. Maybe I am missing an essential setting as well. I think that the issue is OpenMPI related and not UCX, thus I did not provide the UCX build info. I can add that upon request though.