gptune / GPTune

Other
64 stars 18 forks source link

Demo crash on Stampede2/KNL #2

Closed huttered40 closed 2 years ago

huttered40 commented 3 years ago

Hi, I have recently built GPTune successfully and tried to run the first demo. Unfortunately, the demo crashed. It is not apparent where the program crashed, but I suspect it has something to do with mpi4py, as I have included print statements within demo.py that are not being printed out.

I am running on the Stampede2 supercomputer at TACC on a single KNL node. I am using python3/3.7.0.

Here is the crash output:

c456-121[knl](1009)$ $MPIRUN -n 1 python3 ./demo.py -nrun 20 -ntask 5 -perfmodel 0 -optimization GPTune
TACC:  Starting up job 7531094 
TACC:  Starting parallel tasks... 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 175453 RUNNING AT c456-121
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 175453 RUNNING AT c456-121
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
TACC:  MPI job exited with code: 11 
TACC:  Shutdown complete. Exiting. 

This crash occurs regardless of the extra arguments I use (i.e., it still crashes if run with $MPIRUN -n 1 python3 ./demo.py).

Please advise how to fix this issue, if possible.

An additional thing to note:

  1. The README-examples-demo excerpt seems to be missing a sub-directory, as demo.py is located within GPTune/examples/GPTune-Demo/, rather than GPTune/examples/.
liuyangzhuan commented 3 years ago

It seems that you are using intel MPI rather than openmpi. GPTune has two execution modes: MPI spawning mode (the default) and reverse communication interface (RCI) mode. The default mode requires openmpi (not intel mpi, craympich, or spectrum mpi). The RCI mode can indeed work with intel mpi.

You can modify the script https://github.com/gptune/GPTune/blob/master/config_cori.sh to make it work on stampede, with openmpi.

If you cannot work with openmpi, try RCI mode following the examples in the user guide section 5.3 https://github.com/gptune/GPTune/blob/master/Doc/GPTune_UsersGuide.pdf

Thanks for the correction, I've updated the README.

huttered40 commented 3 years ago

Thanks. I am now using OpenMPI v4.1.0, which I manually built as Stampede2 does not have an OpenMPI module. GPTune configures with the following (see below) and builds correctly:

-- The C compiler identification is GNU 6.3.0
-- The CXX compiler identification is GNU 6.3.0
-- The Fortran compiler identification is GNU 6.3.0
-- Check for working C compiler: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpicc
-- Check for working C compiler: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpicc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpiCC
-- Check for working CXX compiler: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpiCC -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working Fortran compiler: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpif90
-- Check for working Fortran compiler: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpif90  -- works
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Checking whether /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpif90 supports Fortran 90
-- Checking whether /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpif90 supports Fortran 90 -- yes
-- Include /home1/05608/tg849075/GPTune/cmake/setup_external_macros.cmake
-- gptuneclcm will be built as a static library.
-- Performing Test qoptmatmul
-- Performing Test qoptmatmul - Failed
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP_Fortran: -fopenmp (found version "4.0") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for Fortran sgemm
-- Looking for Fortran sgemm - not found
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Looking for Fortran sgemm
-- Looking for Fortran sgemm - found
-- Found BLAS: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so;/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so;/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so;/opt/apps/gcc/6.3.0/lib64/libgomp.so;-lm;-ldl  
-- Using TPL_BLAS_LIBRARIES='/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so;/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so;/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so;/opt/apps/gcc/6.3.0/lib64/libgomp.so;-lm;-ldl'
-- Looking for Fortran cheev
-- Looking for Fortran cheev - found
-- A library with LAPACK API found.
-- Using TPL_LAPACK_LIBRARIES=''
-- Found MPI_C: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpicc (found version "3.1") 
-- Found MPI_CXX: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpiCC (found version "3.1") 
-- Found MPI_Fortran: /home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpif90 (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Configuring done
-- Generating done
-- Build files have been written to: /home1/05608/tg849075/GPTune/build

However, when I run the same demo $MPIRUN -n 1 python3 ./demo.py, I get the following crash, regardless of whether I use OpenMPI's mpirun or mpiexec binary:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 230771 on node c455-003 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I added print statements to demo.py and seemed to have traced the crash from the line from gptune import * # import all.

Any idea what could cause this?

liuyangzhuan commented 3 years ago

I'm assuming that you've set DTPL_SCALAPACK_LIBRARIES for the cmake call you posted. But you need to make sure that scalapack is also built with the same openmpi/blas/lapack as you are using and is a shared build. For example, you can use:

wget http://www.netlib.org/scalapack/scalapack-2.1.0.tgz tar -xf scalapack-2.1.0.tgz cd scalapack-2.1.0 rm -rf build mkdir -p build cd build cmake .. \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_C_COMPILER=$MPICC \ -DCMAKE_Fortran_COMPILER=$MPIF90 \ -DCMAKE_INSTALL_PREFIX=. \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=./install \ -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \ -DCMAKE_Fortran_FLAGS="-fopenmp" \ -DBLAS_LIBRARIES="${BLAS_LIB}" \ -DLAPACK_LIBRARIES="${LAPACK_LIB}" make -j32 make install export SCALAPACK_LIB=$PWD/install/lib/libscalapack.so

and then you can pass that to build gptune.

The runtime error is also due to this, as gptune import * will import lcm, which loads lib_gptuneclcm.so (which apparently failed due to not finding scalapack.so)

huttered40 commented 3 years ago

I should clarify, I l did correctly set DTPL_SCALAPACK_LIBRARIES, but on a second try. The first try there was an issue, and I copied the configure statement posted above from the first try, with the SCALAPACK warnings removed.

For sanity, I rebuilt GPTune from scratch. However, I still hit the same exact issue: from gptune import *

Here is the configure statement if I run it again after a previous configure (so it seems to be missing all the diagnostics seen above):

login4.stampede2(1104)$ cmake ..     -DBUILD_SHARED_LIBS=ON     -DCMAKE_CXX_COMPILER=$MPICXX     -DCMAKE_C_COMPILER=$MPICC     -DCMAKE_Fortran_COMPILER=$MPIF90     -DCMAKE_BUILD_TYPE=Release     -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON     -DTPL_BLAS_LIBRARIES="$BLAS_LIB"     -DTPL_LAPACK_LIBRARIES="$LAPACK_LIB"     -DTPL_SCALAPACK_LIBRARIES=$SCALAPACK_LIB
-- Include /home1/05608/tg849075/GPTune/cmake/setup_external_macros.cmake
-- gptuneclcm will be built as a shared library.
-- Using TPL_BLAS_LIBRARIES='/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so;/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so;/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so;/opt/apps/gcc/6.3.0/lib64/libgomp.so;/home1/05608/tg849075/GPTune/build/-lm;/home1/05608/tg849075/GPTune/build/-ldl'
-- A library with LAPACK API found.
-- Using TPL_LAPACK_LIBRARIES=''
-- Using TPL_SCALAPACK_LIBRARIES='/home1/05608/tg849075/GPTune/scalapack-2.1.0/build/install/lib/libscalapack.so'
-- Configuring done
-- Generating done
-- Build files have been written to: /home1/05608/tg849075/GPTune/build

Is there a way I can step through the command from gptune import * so that rather than importing "*" I can import individual modules to isolate the issue? If so, can you give a list of the modules to try importing individually?

younghyunc commented 3 years ago

Hello,

Thank you for your interest in GPTune!

Regarding your issue: I would like to have some time to test GPTune with openmpi v4.1.0 (mostly we have used v4.0.1). @liuyangzhuan may have better insights on this issue.

Regarding your question: if you want, you can use the following code lines instead of using "from gptune import *".

from gptune import GPTune
from data import Data
from data import Categoricalnorm
from options import Options
from computer import Computer
from historydb import *
younghyunc commented 3 years ago

Also, according to your logs, the path to LAPACK (TPL_LAPACK_LIBRARIES) is missing. I guess you can try with the same value as TPL_BLAS_LIBRARIES.

huttered40 commented 3 years ago

Ok thanks. I tried the separate imports and I got same crash as before with the first one: from gptune import GPTune. Regarding the missing LAPACK, I am almost positive that Intel MKL's LAPACK library is loaded with Intel MKL's BLAS library.

@yhcho614 I am fine with waiting for further tests with OpenMPI 4.1.0, as my interests are research-oriented rather than production-oriented. Thanks!

liuyangzhuan commented 3 years ago

@huttered40 @yhcho614 I'm pretty sure the error comes from from gptune import GPTune -> from model import * -> from lcm import LCM It's either you didn't install lib_gptuneclcm correctly, or some runtime environment is missing/messed up.

  1. your cmake configuration now looks correct. Can you check whether you have the gptune library built at $GPTUNEROOT/lib_gptuneclcm.so? If so, what does ldd lib_gptuneclcm.so gives you?
  2. since you installed openmpi on TACC, make sure it doesn't conflict at runtime with other mpi and scalapack modules on TACC. export PATH=[openmpi dir]/bin:$PATH export MPIRUN=[openmpi dir]/bin/mpirun export LD_LIBRARY_PATH=[openmpi dir]/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=[lapack/blas dir]:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=[scalapack dir]/build/install/lib/:$LD_LIBRARY_PATH # this should precede your mkl blas/lapack path, otherwise it may use scalapack from mkl.

Then maybe, to be safe, use $MPIRUN instead of mpirun to invoke the tests.

I've tested openmpi/4.1.0 preliminarily, everything seems to work.

huttered40 commented 3 years ago
login1.stampede2(1002)$ pwd
/home1/05608/tg849075/GPTune
login1.stampede2(1003)$ ldd lib_gptuneclcm.so 
    linux-vdso.so.1 =>  (0x00007ffeb8acd000)
    /opt/apps/xalt/xalt/lib64/libxalt_init.so (0x00002ae537dfb000)
    libmkl_gf_lp64.so => /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so (0x00002ae53803c000)
    libmkl_gnu_thread.so => /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so (0x00002ae538b21000)
    libmkl_core.so => /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00002ae53a234000)
    libgomp.so.1 => /opt/apps/gcc/6.3.0/lib64/libgomp.so.1 (0x00002ae53e23d000)
    libm.so.6 => /lib64/libm.so.6 (0x00002ae53e46a000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002ae53e76c000)
    libscalapack.so => /home1/05608/tg849075/GPTune/scalapack-2.1.0/build/install/lib/libscalapack.so (0x00002ae53e970000)
    libmpi.so.40 => /home1/05608/tg849075/openmpi-4.1.0/_install/lib/libmpi.so.40 (0x00002ae53f0fe000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae53f42f000)
    libc.so.6 => /lib64/libc.so.6 (0x00002ae53f64b000)
    /lib64/ld-linux-x86-64.so.2 (0x00002ae5379d1000)
    libgfortran.so.3 => /opt/apps/gcc/6.3.0/lib64/libgfortran.so.3 (0x00002ae53fa18000)
    libmpi_usempif08.so.40 => /home1/05608/tg849075/openmpi-4.1.0/_install/lib/libmpi_usempif08.so.40 (0x00002ae53fd3e000)
    libmpi_usempi_ignore_tkr.so.40 => /home1/05608/tg849075/openmpi-4.1.0/_install/lib/libmpi_usempi_ignore_tkr.so.40 (0x00002ae53ff74000)
    libmpi_mpifh.so.40 => /home1/05608/tg849075/openmpi-4.1.0/_install/lib/libmpi_mpifh.so.40 (0x00002ae54017c000)
    libgcc_s.so.1 => /opt/apps/gcc/6.3.0/lib64/libgcc_s.so.1 (0x00002ae5403d9000)
    libquadmath.so.0 => /opt/apps/gcc/6.3.0/lib64/libquadmath.so.0 (0x00002ae5405f0000)
    libopen-rte.so.40 => /home1/05608/tg849075/openmpi-4.1.0/_install/lib/libopen-rte.so.40 (0x00002ae54082f000)
    libopen-pal.so.40 => /home1/05608/tg849075/openmpi-4.1.0/_install/lib/libopen-pal.so.40 (0x00002ae540ae7000)
    libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002ae540df9000)
    librt.so.1 => /lib64/librt.so.1 (0x00002ae541003000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002ae54120b000)
    libz.so.1 => /lib64/libz.so.1 (0x00002ae54140e000)

Here are some of the other environment variables. Note that I have always been unloading the Intel MPI module.

login1.stampede2(1036)$ echo $MPIRUN
/home1/05608/tg849075/openmpi-4.1.0/_install/bin/mpiexec
login1.stampede2(1037)$ echo $PATH
/home1/05608/tg849075/openmpi-4.1.0/_install/lib:/opt/apps/xalt/xalt/bin:/opt/apps/intel18/python3/3.7.0/bin:/opt/apps/cmake/3.16.1/bin:/opt/apps/autotools/1.1/bin:/opt/apps/git/2.24.1/bin:/opt/intel/compilers_and_libraries_2018.2.199/linux/bin/intel64:/opt/apps/gcc/6.3.0/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/opt/apps/xsede/gsi-openssh-7.5p1b/bin:/opt/dell/srvadmin/bin:.
login1.stampede2(1038)$ echo $LD_LIBRARY_PATH
/home1/05608/tg849075/GPTune/scalapack-2.1.0/build/lib:/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin:/home1/05608/tg849075/openmpi-4.1.0/_install/lib:/opt/apps/intel18/python3/3.7.0/lib:/opt/intel/debugger_2018/libipt/intel64/lib:/opt/intel/debugger_2018/iga/lib:/opt/intel/compilers_and_libraries_2018.2.199/linux/daal/../tbb/lib/intel64_lin/gcc4.4:/opt/intel/compilers_and_libraries_2018.2.199/linux/daal/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.2.199/linux/tbb/lib/intel64/gcc4.7:/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.2.199/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.2.199/linux/ipp/lib/intel64:/opt/intel/compilers_and_libraries_2018.2.199/linux/compiler/lib/intel64:/opt/apps/gcc/6.3.0/lib64:/opt/apps/gcc/6.3.0/lib:/opt/apps/xsede/gsi-openssh-7.5p1b/lib64:/opt/apps/xsede/gsi-openssh-7.5p1b/lib::

And here is the error statement again (regardless of whether I use OpenMPI's mpirun or mpiexec:

c455-123[knl](1001)$ $MPIRUN -n 1 python3 ./demo.py
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 245947 on node c455-123 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
liuyangzhuan commented 2 years ago

There seems to be no follow-up from the user for every long time. I'm closing this ticket now. In case anyone is interested, we would suggest the use of lite mode of GPTune (see Section 2 of https://github.com/gptune/GPTune/blob/master/Doc/GPTune_UsersGuide.pdf) for easy installation of GPTune.