Closed djfitzgerald closed 5 years ago
@mhoemmen did you encounter that issue previously?
Would it make sense to rename *gemm
to Tpetra_*gemm
or Tpetra_*gemm_wrap
or something equivalent?
@trilinos/tpetra @nmhamster
@basicmanfitz Thanks for reporting! Btw were you able to build the BLAS and LAPACK wrappers in Teuchos?
@lucbv good gods i can't even take a day off ;-P Yes, it would absolutely make sense to rename that function. Ditto for other BLAS and LAPACK wrappers.
@basicmanfitz The patch looks good -- thanks! Any chance you could submit the patch as a pull request against Trilinos' develop branch? That would give you documented credit for submission.
Gladly :-) And no, I have not tried thr BLAS and LAPACK wrappers in Teuchos yet; I'll have to look into that today.
On Sat, May 19, 2018, 16:15 Mark Hoemmen notifications@github.com wrote:
@basicmanfitz https://github.com/basicmanfitz The patch looks good -- thanks! Any chance you could submit the patch as a pull request against Trilinos' develop branch? That would give you documented credit for submission.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/2781#issuecomment-390429821, or mute the thread https://github.com/notifications/unsubscribe-auth/Ak79AA_KSf1fdrRcHl3CIJ1XzHknlxpnks5t0H1OgaJpZM4UFSFK .
@basicmanfitz Thanks! :-D If you were able to build Tpetra you probably were able to build the Teuchos package, so I think it's OK. Teuchos' wrappers don't work quite the same way -- the wrapper is templated on the Scalar type, and specializations call directly into the extern "C" functions. Tpetra's wrapper more carefully isolates the implementation's header files, to avoid build issues with the cuBLAS version 1 vs. version 2 API conflict.
@mhoemmen It looks like I did build with Teuchos, as Spack included that by default in my build. I've forked Trilinos and am testing my patch there now.
@basicmanfitz Awesome; thanks! :-D FYI we have pull request testing for GCC + x86, so as long as it works for XL, please feel free to submit the pull request.
@djfitzgerald May I close this issue? Thanks!
@mhoemmen close away :-) Or, well... I will.
@djfitzgerald Thanks for the PR fix! :D
I just realized that I closed this before I could UT a fix and do a PR on it (I'd been held back by #2789). I'm reopening until I can do so.
I'm going to have to walk away from this problem. I've been unable to get a non-Spack Trilinos environment working for unit testing, and really don't have any more time to invest in this. I have fixes already committed to Spack that apply patches to the Trilinos 12.12.1 to correct this issue. Someone else should take this up and make the changes to Trilinos itself.
@nmhamster
@djfitzgerald I appreciate your help in working on this issue! I'm not sure whom you should contact to get help with this. @nmhamster might have some suggestions.
@mhoemmen @nmhamster My problems come when I try to build zoltan. For the record, here are the steps I'm performing:
cd
into my home directory and then downloaded a *.zip archive of my working branch through GitHub with wget https://github.com/djfitzgerald/Trilinos/archive/fix-2781.zip
unzip fix-2781.zip
mkdir ~/Trilinos
/nfshome/fitzgerald/hdf5
and symlinked into /nfshome/fitzgerald/hdf5-1.10.2/hdf5
)
c. netcdf-4.6.1, configured with the command CPPFLAGS=-I/nfshome/fitzgerald/hdf5/include LDFLAGS=-L/nfshome/fitzgerald/hdf5/lib ./configure --prefix=/nfshome/fitzgerald/netcdf
cd
into new directory Trilinos
cmake
command:
/$HOME/cmake-3.11.3/bin/cmake \
-DCMAKE_C_COMPILER=/usr/bin/xlc_r \
-DCMAKE_CXX_COMPILER=/usr/bin/xlc++_r \
-DCMAKE_Fortran_COMPILER=/usr/bin/xlf2008_r \
-DTPL_ENABLE_MPI=ON \
-DTPL_BLAS_LIBRARIES='/opt/ibmmath/essl/6.1/lib64/libessl.so;/opt/ibmmath/essl/6.1/lib64/libessl6464.so;/opt/ibmmath/essl/6.1/lib64/libesslsmp6464.so;/opt/ibmmath/essl/6.1/lib64/libesslsmpcuda.so;/opt/ibmmath/essl/6.1/lib64/libesslsmp.so' \
-DTPL_LAPACK_LIBRARIES='/usr/lib64/liblapacke.so.3.4.2;/usr/lib64/liblapack.so.3.4.2' \
-DNetcdf_LIBRARY_DIRS='/nfshome/fitzgerald/netcdf/lib' \
-DTPL_Netcdf_INCLUDE_DIRS='/nfshome/fitzgerald/netcdf/include' \
-DMPI_BASE_DIR=/opt/ibm/spectrum_mpi/bin \
-DTrilinos_ENABLE_ALL_PACKAGES=ON \
-DTPL_ENABLE_Matio=OFF \
-DTPL_ENABLE_X11=OFF \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DTrilinos_ENABLE_EXAMPLES:BOOL=ON \
-DTrilinos_ENABLE_CXX11:BOOL=ON \
-DTrilinos_CXX11_FLAGS:STRING=-std=gnu++11 \
-DCMAKE_INSTALL_PREFIX=/nfshome/fitzgerald/Trilinos \
/nfshome/fitzgerald/Trilinos-fix-2781
export INCLUDE_PATH=$INCLUDE_PATH:/opt/ibm/spectrum_mpi/include/:/nfshome/fitzgerald/netcdf/include/ ; make -j install
And I end up getting a lot of errors of the type fatal error: 'mpi.h' file not found
before my make
fails:
. . . .
Error while processing /nfshome/fitzgerald/Trilinos-fix-2781/packages/zoltan/src/zz/zz_sort.c.
make[2]: *** [packages/zoltan/src/CMakeFiles/zoltan.dir/zz/zz_util.c.o] Error 1
[ 13%] Built target MueLu_ParameterList_Output_cp
make[2]: *** [packages/zoltan/src/CMakeFiles/zoltan.dir/zz/zz_sort.c.o] Error 1
1 error generated.
Error while processing /nfshome/fitzgerald/Trilinos-fix-2781/packages/zoltan/src/zz/zz_gen_files.c.
make[2]: *** [packages/zoltan/src/CMakeFiles/zoltan.dir/zz/zz_gen_files.c.o] Error 1
[ 15%] Built target Zoltan_hg_vwgt_copy_files
[ 16%] Built target exodus
1 error generated.
Error while processing /nfshome/fitzgerald/Trilinos-fix-2781/packages/zoltan/src/reftree/reftree_build.c.
make[2]: *** [packages/zoltan/src/CMakeFiles/zoltan.dir/reftree/reftree_build.c.o] Error 1
make[1]: *** [packages/zoltan/src/CMakeFiles/zoltan.dir/all] Error 2
[ 17%] Built target Zoltan_hg_felix_copy_files
[ 18%] Built target Zoltan_hg_simple_copy_files
[ 20%] Built target Zoltan_ch_bug_copy_files
[ 22%] Built target Zoltan_ch_nograph_copy_files
[ 25%] Built target Zoltan_ch_onedbug_copy_files
[ 29%] Built target Zoltan_hg_cage10_copy_files
[ 29%] Built target Zoltan_ch_vwgt_copy_files
[ 33%] Built target Zoltan_hg_ibm03_copy_files
[ 36%] Built target Zoltan_ch_ewgt_copy_files
[ 41%] Built target Zoltan_ch_grid20x19_copy_files
[ 46%] Built target Zoltan_ch_simple_copy_files
[ 57%] Built target Zoltan_ch_hammond_copy_files
make: *** [all] Error 2
[f8n10][/nfshome/fitzgerald/Trilinos]>
Its frustrating because this is a simple enough fix. My patch for Trilinos 12.12.1 that I wrote for Spack applies and fixes it, but I've been unable to get Trilinos working enough for me to unit test the fix on the latest develop
branch code.
Can you try removing "/bin" from your MPI_BASE_DIR option when you compile?
This may allow CMake to find mpi.h in $MPI_BASE_DIR/include.
@kddevin is totally right. For many MPI implementations, if mpiexec
etc. aren't in your PATH, it suffices to set MPI_BASE_DIR
and nothing else (other than perhaps TPL_ENABLE_MPI=ON
).
@mhoemmen @kddevin Removing /bin/
seemed to help somewhat but not enough. I'm uploading the latest build log, if either of you can give me other ideas on what to do.
zoltan-failure-build.log
Thanks for trying the new MPI_BASE_DIR, @djfitzgerald. These errors are due to your configuration. Other packages (not just Zoltan) are failing to find mpi.h. I am not familiar with the IBM compilers, so I'll ask a few questions.
In your CmakeCache.txt , is MPI_BASE_DIR set correctly (i.e., without the /bin)? If not, it may help to start fresh in a new, empty build directory with the correct -DMPI_BASE_DIR=path_without_bin Cmake option.
Does directory MPI_BASE_DIR/include exist and have file mpi.h?
Does the xlc_r compiler automatically wrap mpi? That is, can you use it to build an mpi-enabled helloworld.c program (one that, say, prints the rank of each processor) without having to specify include paths?
As a last resort, I think you can tell cmake in which directory mpi.h can be found. Usually it is MPI_BASE_DIR/include, but maybe the IBM installation is different. You'll want to be sure that the path you specify has the correct mpi.h for your mpi library: -DTPL_MPI_INCLUDE_DIRS=path_to_mpi_include_files
Hi @djfitzgerald , I may suggest you to try to use GNU fortran compiler options "-fno-underscoring" or XL fortran compiler options "-qextname" to see if it can work easier.
The kokkos-kernels package has taken over Tpetra's GEMM wrappers, so this issue is no longer valid. Thanks for reporting!
@trilinos/Tpetra
Expectations
Attempt to build
trilinos
withtpetra
support using the IBM XL compiler succeeds.Current Behavior
Using Spack, I attempted to build and install
trilinos
withtpetra
support using the IBM XL compiler. The specific Spack command was:spack install --test=root trilinos%xl_r ~pnetcdf ^netlib-lapack+external-blas ^netcdf%gcc ^m4%gcc
, which will attempt to build and installtrilinos
with the parallel IBM XL compiler, withoutpnetcdf
support, withnetlib-lapack
support using an external BLAS provider (in my case, IBM ESSL), withnetcdf
andm4
dependencies built with gcc, and then perform anycmake
automated testing for thetrilinos
package.Spack produces the following error output:
Motivation and Context
This prevented my team from being able to compile and install the Trilinos package with Tpetra enabled on an IBM Power9 system using the IBM XL compiler suite. Being able to do so was a hard requirement provided to us by our customer. My team had to debug the problem and develop a patch which may or may not be the optimal solution for this defect.
Definition of Done
Correct mangling error in C/Fortran interface code (see "Possible Solution" section below). Spack must be able to build and install Trilinos with Tpetra support using the IBM XL compiler, and must be able to run Trilinos' automated
cmake
verification tests.Possible Solution
The cause of this problem is that
gfortran
appends '_' to external subroutine names in*.o
files butxlf
does not.dgemm
is apparently a Fortran subroutine, andTpetra_Details_libGemm.cpp
was developed as a C wrapper to the gemm interfaces. Because Fortran is pass-by-reference and C/C++ pass-by-value, the wrappers inTpetra_Details_libGemm.cpp
allow C/C++ callers to call the gemm functions with values, and invoke the actual Fortran *gemm functions with pointers to those values . The wrapper function uses theTPETRACORE_F77_BLAS_MANGLE
macro to determine the C-mangled name of the Fortran function.TPETRACORE_F77_BLAS_MANGLE
is built during configure time whencmake
presumably figures out how the Fortran compiler builds external function names in*.o
files. In thegfortran
case,TPETRACORE_DGEMM TPETRACORE_F77_BLAS_MANGLE(dgemm,DGEMM)
becomesdgemm_
, and all goes well. However, in thexlf
case, this resolves todgemm
without the underbar, which happens to be the same name as thedgemm
wrapper function defined in theTpetra_Details_libGemm.cpp
source file. Also, since theTPETRACORE_DGEMM
function prototype is declared in the scope ofextern "C"
, mangling rules don't apply for the local definition either. So now you have a prototype that takes pointer arguments and a function definition that takes value arguments, which causes the error that I observed.My fix was to redefine the C/C++ wrapper function names as
*gemm_wrap
, eliminating the possibility that the wrapper names could collide with the actual Fortran function names. This seems to get us past this problem, although I admit thorough testing has not been done as I am not intimately familiar with Trilinos and its' use cases.Steps to Reproduce
. ~/spack/share/spack/setup-env.sh ; export PATH=$PATH:$HOME/spack
spack compilers
and verify that Spack has detected the IBM XL compilers and gcc > 4.9.0. If your gcc is less than 4.9.0, do the following: 4a.spack install gcc@5.1.0
4b. When Spack has installed gcc 5.1.0, it will display a line of output indicating where it was installed to. Copy that path. 4c.spack compiler add GCCPATH
where GCCPATH is the path you copied in step 4b. 4d. Issuespack compilers
and verify that Spack has detected gcc 5.1.0.spack install trilinos%xl_r ~pnetcdf ^netlib-lapack+external-blas ^netcdf%gcc ^m4%gcc
Your Environment
Related Issues
This was originally opened as Spack issue 7247.
Additional Information
My patch to fix this problem is attached, as is a version of Spack without my fix. xlf_tpetra.patch.txt spack-develop.zip