t-sakashita / rokko

Integrated Interface for libraries of eigenvalue decomposition
Boost Software License 1.0
10 stars 2 forks source link

enagaでEigenExa-2.4bが終了しない。 #399

Open t-sakashita opened 5 years ago

t-sakashita commented 5 years ago

行列サイズは100でも終わらない。 対角化は終わっている。

t-sakashita commented 5 years ago

minij_mpi_thread_singleを試した。

以下も終了しない。

minij_mpi_thread_single eigenexa:penta 100

ELPAは正常終了した。

t-sakashita commented 5 years ago

EigenExa付属のベンチマークプログラムmain2を実行してみる。

t-sakashita commented 5 years ago

途中までは実行できた。

SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35

Node   MPI
------------
r1i6n0 4
r1i6n1 4
r1i6n3 4
r1i6n4 4

 INPUT FILE='IN'
 ======================================================
 ## EigenExa version (2.4b) / (August 20, 2018) / (Akashi)
 Solver = eigen_s  / via tri-diagonal format
 Block width =           48 /         128
 NUM.OF.PROCESS=          16 (           4           4 )
 NUM.OF.THREADS=           8
 Matrix dimension =           10
 Matrix type =            0  (Frank matrix)
 Internally required memory =                3355280  [Byte]
 The number of eigenvectors computed =           10
 mode 'A' :: all the eigenpairs
 Elapsed time =   0.331598039716482       [sec]
 FLOP         =    3333.33333333333     
 Performance  =   1.005233123869897E-005  [GFLOPS]
 * Since FLOPs on D&C could not be counted up correctly, above performance
   is lower than the actual, which could be 10-25 % higher :
  (  1.105756460223511E-005 -  1.256541404837370E-005 )
 -----------------------------------------------
 cond(A)=|w_max|/|w_min|=   44.7660686527145      /  0.255679562796436     
        =   175.086612958408     
 max|w(i)-w(i).true|/|w.true|=  1.222170994687292E-014   44.7660686527151     
 *** Eigenvalue Relative Error *** : PASSED
 max|w(i)-w(i).true|         =  5.471179065352771E-013   44.7660686527151     
 *** Eigenvalue Absolute Error *** : PASSED
 -----------------------------------------------
 |A|_{1}=   55.0000000000000     
 epsilon=  2.220446049250313E-016
 max|Ax-wx|_{1}/Ne|A|_{1}=  0.280000000000000               10
 *** Residual Error Test ***   : PASSED
 |ZZ-I|_{F}/sqrt(N)=  1.141308048004556E-015
 *** Orthogonality  Test ***   : PASSED
 ======================================================

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
{    0,    0}:  On entry to 
DSTEQR parameter number  -62 had an illegal value 
t-sakashita commented 5 years ago

ファイルINを、最初の1行だけ有効にした:

 10  10 48 128 1 0 1 1
-1

出力:

SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35

Node   MPI
------------
r1i5n8 4
r1i5n9 4
r1i5n10 4
r1i5n11 4

 INPUT FILE='IN'
 ======================================================
 ## EigenExa version (2.4b) / (August 20, 2018) / (Akashi)
 Solver = eigen_s  / via tri-diagonal format
 Block width =           48 /         128
 NUM.OF.PROCESS=          16 (           4           4 )
 NUM.OF.THREADS=           8
 Matrix dimension =           10
 Matrix type =            0  (Frank matrix)
 Internally required memory =                3355280  [Byte]
 The number of eigenvectors computed =           10
 mode 'A' :: all the eigenpairs
 Elapsed time =   6.523737916722894E-002  [sec]
 FLOP         =    3333.33333333333     
 Performance  =   5.109545134835498E-005  [GFLOPS]
 * Since FLOPs on D&C could not be counted up correctly, above performance
   is lower than the actual, which could be 10-25 % higher :
  (  5.620499770140097E-005 -  6.386931418544373E-005 )
 -----------------------------------------------
 cond(A)=|w_max|/|w_min|=   44.7660686527145      /  0.255679562796436     
        =   175.086612958408     
 max|w(i)-w(i).true|/|w.true|=  1.222170994687292E-014   44.7660686527151     
 *** Eigenvalue Relative Error *** : PASSED
 max|w(i)-w(i).true|         =  5.471179065352771E-013   44.7660686527151     
 *** Eigenvalue Absolute Error *** : PASSED
 -----------------------------------------------
 |A|_{1}=   55.0000000000000     
 epsilon=  2.220446049250313E-016
 max|Ax-wx|_{1}/Ne|A|_{1}=  0.280000000000000               10
 *** Residual Error Test ***   : PASSED
 |ZZ-I|_{F}/sqrt(N)=  1.141308048004556E-015
 *** Orthogonality  Test ***   : PASSED
 ======================================================

 Benchmark completed

正常終了した。

t-sakashita commented 5 years ago

no_rokkoにおいて、サブルーチンeigen_sxの呼び出しをコメントアウトしてみる。

t-sakashita commented 5 years ago

no_rokko/dense_minij_mpi/eigen_exa.f90において、ソルバ呼び出しをコメントアウトしてみる:

  !call eigen_sx( n, n, a, nm, w, z, nm, 48, 128, 'A')

正常終了した。

t-sakashita commented 5 years ago

Debugモードでビルドし、デバッガが呼び出されるようにする。

t-sakashita commented 5 years ago

以下のスクリプトでは、正常終了する。

DIM=100

#QSUB -queue L4cpu                                                                                                                    
#QSUB -node 1                                                                                                                         
#QSUB -mpi 1                                                                                                                          
#QSUB -omp 8                                                                                                                          
#QSUB -place distribute

DIM=10でも正常終了した。

t-sakashita commented 5 years ago

MPIプロセス数を4にしてみる。

DIM=100

#QSUB -queue L4cpu                                                                                                                    
#QSUB -node 1                                                                                                                         
#QSUB -mpi 1                                                                                                                          
#QSUB -omp 8                                                                                                                          
#QSUB -place distribute   

実行時エラー:

mpijob CMD    : mpirun -f /tmp/mpiexec.params.k001007.77375
mpijob PARAMS : -d /home/k0010/k001007/jobscript/minij_mpi/no_rokko
        r1i5n10 4 omplace -nt 8 -c 0-15,20-35
        "/work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa"
        "100"

SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35

Node   MPI
------------
r1i5n10 4

[k001007@enaga1 no_rokko]$ cat minij_mpi_eigen_exa_100_distribute_4proc.sh.e176940 
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
    Process ID: 77522, Host: r1i5n10, Program: /work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa
    MPT Version: HPE MPT 2.16  06/02/17 01:08:38

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/77522/exe, process 77522
MPT: (no debugging symbols found)...done.
MPT: done.
MPT: [New LWP 77550]
MPT: [New LWP 77546]
MPT: [New LWP 77542]
MPT: [New LWP 77537]
MPT: [New LWP 77534]
MPT: [New LWP 77530]
MPT: [New LWP 77526]
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaac31651d9 in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
MPT: To enable execution of this file add
MPT:    add-auto-load-safe-path /home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home/k0010/k001007/.gdbinit".
MPT: To completely disable this security protection add
MPT:    set auto-load safe-path /
MPT: line to your configuration file "/home/k0010/k001007/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
MPT:    info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libbitmask-2.0-sgi717r2.rhel74.x86_64 libcpuset-1.0-sgi717r3.rhel74.x86_64 libibverbs-41mlnx1-OFED.4.7.0.0.2.47100.x86_64 libmlx4-41mlnx1-OFED.4.5.0.0.3.47100.x86_64 libmlx5-41mlnx1-OFED.4.7.0.3.3.47100.x86_64 libnl3-3.2.28-4.el7.x86_64 libnuma-3.0sgi-sgi716r61.rhel73.x86_64 libxcb-1.13-1.el7.x86_64 numatools-2.0-sgi717r6.rhel74.x86_64
MPT: (gdb) #0  0x00002aaac31651d9 in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaac388f806 in mpi_sgi_system (
MPT: #2  MPI_SGI_stacktraceback (
MPT:     header=header@entry=0x7fffffffae80 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 77522, Host: r1i5n10, Program: /work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa\n\tMPT Version: HPE MPT 2.16 "...) at sig.c:339
MPT: #3  0x00002aaac388fa08 in first_arriver_handler (signo=signo@entry=11, 
MPT:     stack_trace_sem=stack_trace_sem@entry=0x2aaacf9a0500) at sig.c:488
MPT: #4  0x00002aaac388fdeb in slave_sig_handler (signo=11, 
MPT:     siginfo=<optimized out>, extra=<optimized out>) at sig.c:563
MPT: #5  <signal handler called>
MPT: #6  load_with_acquire (location=<optimized out>)
MPT:     at ../../include/tbb/tbb_machine.h:611
MPT: #7  __TBB_load_with_acquire (location=<optimized out>)
MPT:     at ../../include/tbb/tbb_machine.h:714
MPT: #8  FencedLoad (location=<optimized out>)
MPT:     at ../../src/tbbmalloc/Customize.h:109
MPT: #9  tryLock (this=0xbcc366526ce601c8, state=<optimized out>)
MPT:     at ../../src/tbbmalloc/backend.cpp:220
MPT: #10 trySetLeftUsed (this=<optimized out>, s=<optimized out>)
MPT:     at ../../src/tbbmalloc/backend.cpp:284
MPT: #11 tryLockBlock (this=<optimized out>) at ../../src/tbbmalloc/backend.cpp:291
MPT: #12 rml::internal::Backend::IndexedBins::getFromBin (
MPT:     this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, 
MPT:     binIdx=0, sync=0x2aaab27ade00 <nolibxml.1146.0.13>, size=46912627203680, 
MPT:     needAlignedRes=false, alignedBin=true, wait=false, 
MPT:     binLocked=0x7fffffffc118) at ../../src/tbbmalloc/backend.cpp:439
MPT: #13 0x00002aaab25229ed in rml::internal::Backend::IndexedBins::findBlock (
MPT:     this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, 
MPT:     nativeBin=0, sync=0x2aaab27ade00 <nolibxml.1146.0.13>, 
MPT:     size=46912627203680, resSlabAligned=false, alignedBin=true, 
MPT:     numOfLockedBins=0xae000) at ../../src/tbbmalloc/backend.cpp:823
MPT: #14 0x00002aaab2522920 in rml::internal::Backend::genericGetBlock (
MPT:     this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, num=0, 
MPT:     size=46912627203584, needAlignedBlock=96)
MPT:     at ../../src/tbbmalloc/backend.cpp:886
MPT: #15 0x00002aaab2523949 in rml::internal::Backend::getLargeBlock (
MPT:     this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, size=0)
MPT:     at ../../src/tbbmalloc/backend.cpp:927
MPT: #16 0x00002aaab2524720 in rml::internal::ExtMemoryPool::mallocLargeObject (
MPT:     this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, 
MPT:     pool=0x0, allocationSize=46912627203584)
MPT:     at ../../src/tbbmalloc/large_objects.cpp:915
MPT: #17 0x00002aaab251c603 in rml::internal::MemoryPool::getFromLLOCache (
MPT:     this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, tls=0x0, 
MPT:     size=46912627203584, alignment=46912627203680)
MPT:     at ../../src/tbbmalloc/frontend.cpp:2255
MPT: #18 0x00002aaab251cf8b in allocateAligned (memPool=<optimized out>, 
MPT:     size=<optimized out>, alignment=<optimized out>)
MPT:     at ../../src/tbbmalloc/frontend.cpp:2351
MPT: #19 scalable_aligned_malloc (size=46912627216192, alignment=0)
MPT:     at ../../src/tbbmalloc/frontend.cpp:3048
MPT: #20 0x00002aaabb8ddae6 in for_allocate ()
MPT:    from /work/k0010/k001007/rokko/petsc-3.12.0-1/Release/lib/libpetsc.so.3.12
MPT: #21 0x00002aaab387ceb3 in trbakwy4_mod_mp_eigen_common_trbakwy_ ()
MPT:    from /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #22 0x00002aaab3882b7b in eigen_sx_ ()
MPT:    from /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #23 0x000000000040703e in MAIN__ ()
MPT: #24 0x00000000004062ae in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT:    Inferior 1 [process 77522] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/77522/exe, process 77522

MPT: -----stack traceback ends-----
MPT: On host r1i5n10, Program /work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa, Rank 0, Process 77522: Dumping core on signal SIGSEGV(11) into directory /home/k0010/k001007/jobscript/minij_mpi/no_rokko
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
    aborting job
MPT: Received signal 11
t-sakashita commented 5 years ago

なぜ、MPT: from /work/k0010/k001007/rokko/petsc-3.12.0-1/Release/lib/libpetsc.so.3.12が関与するのか?

t-sakashita commented 5 years ago

以下より、セグメントーションフォールトが起きているようだ。

Dumping core on signal SIGSEGV(11) into directory /home/k0010/k001007/jobscript/minij_mpi/no_rokko
t-sakashita commented 5 years ago

Fortranのオプション引数を、Cバインディングに渡す際にエラーが起こっているのではないか?

t-sakashita commented 4 years ago

Debugモードで、MPIプロセス数16で、benchmark/use_rokko/dense_minij_mpi/minij_mpiが正常終了した。

デバッグができない。。

t-sakashita commented 4 years ago

eigen_sで試した。

benchmark/no_rokko/dense_minij_mpi/eigen_s.f90を作成した。 eigen_sxと同様に、止まってしまった。

t-sakashita commented 4 years ago

Rokko付属のインストールスクリプトを使用しない。

EigenExa付属のconfigureを使ってみる。

configureにより、以下のMakefileが作成される。

デフォルトでは、Intel MPIのBLACSがリンクされてしまう。(-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl) それらを全て、以下のように、リンクするライブラリをSGI-MPTに書き換える:

#OPT_LD_LAPACK = -I/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/include -L/home/app/intel/compilers_and_libraries\
_2018.5.274/linux/mkl/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -\
liomp5 -lpthread -lm -ldl                                                                                                         
OPT_LD_LAPACK = -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallel

作成された実行ファイルeigenexa_benchmarkを実行してみた。

MPIプロセス16=4x4で、正常終了した。 ビルドモードは、リリースのはず。 入力ファイルINの内容は短くせず、そのまま用いた。

Rokkoのインストールスクリプトの問題のようだ。 SGIMPTに関するオプションを確認する。

t-sakashita commented 4 years ago

以下のオプションの指定を除いてみる。

      -DCMAKE_C_FLAGS="-mt" -DCMAKE_Fortran_FLAGS="-mt" \
      -DMPI_C_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" -DMPI_Fortran_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" \
      -DMPI_C_LIBRARIES="-mt" -DMPI_Fortran_LIBRARIES="-mt" \
t-sakashita commented 4 years ago

逐次のコンパイラが使われていた。

t-sakashita commented 4 years ago

オプションを取り除いて、並列コンパイラmpif90等を使うようにした。 入力ファイルINの2行目で、以下のエラーとなる。

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
{    0,    0}:  On entry to 
DSTEQR parameter number  -62 had an illegal value 
t-sakashita commented 4 years ago

動的ではなく、静的ライブラリlibEigenExa.aを作る。

  option(BUILD_SHARED_LIBS "Build shared libraries" OFF)
t-sakashita commented 4 years ago

configureを使った時に、付加されたコンパイルオプションは、以下の通り:

-qopenmp -Ofast -xHOST -fpp  -fp-model strict -fPIC
t-sakashita commented 4 years ago

Rokkoで検出されたScaLAPACKは、以下の通り:

ScaLAPACK libraries: /home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so;/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_blacs_sgimpt_lp64.so

念のため、普遍的なオプションである-lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallelを陽に指定してみる。

https://github.com/t-sakashita/rokko/wiki/IntelRokkoInstall

t-sakashita commented 4 years ago

configureの場合を参考に、プリプロセッサ定数は、以下のみに変更した。

add_definitions(-DTIMER_PRINT=0)
t-sakashita commented 4 years ago

Rokkoで検出されたBLASを使うのが原因か?

Found BLAS: /home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so;/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so;/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64_lin/libmkl_core.so;/home/app/freeware/gcc/6.3.0/lib64/libgomp.so;-lm;-ldl  
t-sakashita commented 4 years ago

Rokkoに関係なく、EigenExa単体EigenExa-2.4b-build-Release/benchmark/main2の実行も、同様のエラーとなった。

mpijob CMD    : mpirun -f /tmp/mpiexec.params.k001007.193406
mpijob PARAMS : -d /work/k0010/k001007/build/EigenExa-2.4b-build-Release/benchmark
        r1i6n0 4, r1i6n1 4, r1i6n5 4, r1i6n8 4 omplace -nt 8 -c 0-15,20-35
        "/work/k0010/k001007/build/EigenExa-2.4b-build-Release/benchmark/main2"

SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35

Node   MPI
------------
r1i6n0 4
r1i6n1 4
r1i6n5 4
r1i6n8 4

 INPUT FILE='IN'
 ======================================================
 ## EigenExa version (2.4b) / (August 20, 2018) / (Akashi)
 Solver = eigen_s  / via tri-diagonal format
 Block width =           48 /         128
 NUM.OF.PROCESS=          16 (           4           4 )
 NUM.OF.THREADS=           8
 Matrix dimension =           10
 Matrix type =            0  (Frank matrix)
 Internally required memory =             1661106696  [Byte]
 The number of eigenvectors computed =           10
 mode 'A' :: all the eigenpairs
 Elapsed time =   7.078798103611916E-002  [sec]
 FLOP         =    3333.33333333333     
 Performance  =   4.708897307909543E-005  [GFLOPS]
 * Since FLOPs on D&C could not be counted up correctly, above performance
   is lower than the actual, which could be 10-25 % higher :
  (  5.179787150969358E-005 -  5.886121634886928E-005 )
 -----------------------------------------------
 cond(A)=|w_max|/|w_min|=   44.7660686527145      /  0.255679562796436     
        =   175.086612958408     
 max|w(i)-w(i).true|/|w.true|=  1.222170994687292E-014   44.7660686527151     
 *** Eigenvalue Relative Error *** : PASSED
 max|w(i)-w(i).true|         =  5.471179065352771E-013   44.7660686527151     
 *** Eigenvalue Absolute Error *** : PASSED
 -----------------------------------------------
 |A|_{1}=   55.0000000000000     
 epsilon=  2.220446049250313E-016
 max|Ax-wx|_{1}/Ne|A|_{1}=  0.280000000000000               10
 *** Residual Error Test ***   : PASSED
 |ZZ-I|_{F}/sqrt(N)=  1.141308048004556E-015
 *** Orthogonality  Test ***   : PASSED
 ======================================================

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
{    0,    0}:  On entry to 
DSTEQR parameter number  -62 had an illegal value 
[k001007@enaga1 no_rokko]$ cat main2_build.e178798 
=>> PBS: job killed: walltime 2423 exceeded limit 2400
MPT: Received signal 15
t-sakashita commented 4 years ago

ScaLAPACKとのリンク対象

configureに揃えてみる。

t-sakashita commented 4 years ago
/home/app/hpe/mpt/2.16/bin/mpif90    -qopenmp -O3 CMakeFiles/main2.dir/main2.f.o CMakeFiles/main2.dir/mat_set.f.o CMakeFiles/main2.dir/ev_test.f.o CMakeFiles/main2.dir/w_test.f.o  -o main2 ../src/libEigenExa.a -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallel -lirng -ldecimal -lcilkrts -lstdc++ 

上記を見ると、-lstdc++がリンクされている。

リンカはmpif90かmpiccか?

EigenExa-2.4/benchmark/CMakeLists.txtには以下のようにある。

if(USE_C_LINKER)
  set_target_properties(main2 PROPERTIES LINKER_LANGUAGE C)
endif(USE_C_LINKER)

リンク時には、mpif90が使われていた。

t-sakashita commented 4 years ago

Rokko同梱のインストールスクリプトをいじってみる。

diff --git a/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh b/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh
index 9585911e..ba176b0d 100644
--- a/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh
+++ b/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh
@@ -17,13 +17,10 @@ for build_type in $BUILD_TYPES; do
   cd EigenExa-$EIGENEXA_VERSION-build-$build_type
   check cmake -DCMAKE_BUILD_TYPE=$build_type -DCMAKE_INSTALL_PREFIX=$PREFIX \
       -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpif90 \
-      -DCMAKE_C_FLAGS="-mt" -DCMAKE_Fortran_FLAGS="-mt" \
       -DMPI_C_COMPILER=mpicc -DMPI_Fortran_COMPILER=mpif90 \
-      -DMPI_C_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" -DMPI_Fortran_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" \
-      -DMPI_C_LIBRARIES="-mt" -DMPI_Fortran_LIBRARIES="-mt" \
       -DSCALAPACK_LIB="-lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallel" \
       $BUILD_DIR/EigenExa-$EIGENEXA_VERSION
-  check make -j2
+  check make -j2 VERBOSE=1
   $SUDO make install
 done

ScaLAPACKのリンクをmain2だけにしてみた。

--- a/3rd-party/install/EigenExa/EigenExa-2.4b.patch
+++ b/3rd-party/install/EigenExa/EigenExa-2.4b.patch
@@ -196,7 +196,7 @@ diff -crN EigenExa-2.4b.orig/benchmark/CMakeLists.txt EigenExa-2.4b/benchmark/CM
 + 
 + add_executable(main2 main2.f mat_set.f ev_test.f w_test.f)
 + target_include_directories(main2 PUBLIC ${CMAKE_BINARY_DIR}/src/modules)
-+ target_link_libraries(main2 EigenExa)
++ target_link_libraries(main2 EigenExa ${SCALAPACK_LIBRARIES})
 + 
 + if(USE_C_LINKER)
 +   set_target_properties(main2 PROPERTIES LINKER_LANGUAGE C)
@@ -226,7 +226,7 @@ diff -crN EigenExa-2.4b.orig/src/CMakeLists.txt EigenExa-2.4b/src/CMakeLists.txt
 + trbakwy4_body.F trbakwy4.F
 + eigen_scaling.F eigen_sx.F eigen_s.F)
 + add_library(EigenExa ${SOURCES})
-+ target_link_libraries(EigenExa ${SCALAPACK_LIBRARIES} ${MPI_Fortran_LIBRARIES})
++ !!target_link_libraries(EigenExa ${MPI_Fortran_LIBRARIES})
 + set_target_properties(EigenExa PROPERTIES Fortran_MODULE_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/modules)
 + install(TARGETS EigenExa ARCHIVE DESTINATION lib LIBRARY DESTINATION lib RUNTIME DESTINATION bin)
 + install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/modules/ DESTINATION include)

その結果、同様のエラーが出た。

t-sakashita commented 4 years ago

OpenMPを外してみる。

t-sakashita commented 4 years ago

configureでは、コンパイルオプション-fp-model strictが付けられていた。

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-fp-model-fp

CMakeのコマンドライン引数に、以下を追加した。

      -DCMAKE_Fortran_FLAGS="-fp-model strict" \

EigenExa-2.4b-build-Release/benchmark/main2の実行が正常終了した。 原因は、コンパイルオプション-fp-model strictがなかったことと判明した。

t-sakashita commented 4 years ago

インストールスクリプトに-DCMAKE_Fortran_FLAGS="-fp-model strict"を追加した。 96b97b36b6e59bccd1d7077f18933c394cc79307

t-sakashita commented 4 years ago

Rokkoでのコンパイルエラー

ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_libs.F.o): relocation R_X86_64_32 against symbol `eigen_libs_mod_mp_eigen_version_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_sx.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_s.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_blacs.F.o): relocation R_X86_64_32 against undefined symbol `eigen_blacs_mod_mp_blacs_icontxt_for_eigen_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_devel.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_time_bcast__' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(comm.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dc_redist1.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dc_redist2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dc2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dcx.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(bisect.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(bisect2.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd.F.o): relocation R_X86_64_32 against symbol `eigen_devel_mod_mp_trd_comm_world_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd.F.o): relocation R_X86_64_32 against symbol `eigen_devel_mod_mp_trd_comm_world_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(trbakwy4.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_scaling.F.o): relocation R_X86_64_32 against `__STRLITPACK_83' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dlaed6_init.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdsxedc.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdstedc.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_t1.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t2.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t4.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t5.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t5x.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t6_3.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t7.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_inod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t8.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t4x.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t5.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t6_3.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t7.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t8.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(trbakwy4_body.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed0.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlasrt.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed0.F.o): relocation R_X86_64_32 against `__STRLITPACK_84' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dlacpy.F.o): relocation R_X86_64_32 against `__STRLITPACK_1' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaedz.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed1.F.o): relocation R_X86_64_32 against `__STRLITPACK_89' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed1.F.o): relocation R_X86_64_32 against `__STRLITPACK_87' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed3.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaedz.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed3.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(CSTAB.F.o): relocation R_X86_64_PC32 against symbol `get_delta_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: 最終リンクに失敗しました: 不正な値です
t-sakashita commented 4 years ago

https://stackoverflow.com/questions/38296756/what-is-the-idiomatic-way-in-cmake-to-add-the-fpic-compiler-option

set_property(TARGET EigenExa PROPERTY POSITION_INDEPENDENT_CODE ON)

試した変更:

-+ add_library(EigenExa ${SOURCES})
++ add_library(EigenExa SHARED ${SOURCES})
++ set_property(TARGET EigenExa PROPERTY POSITION_INDEPENDENT_CODE ON)
t-sakashita commented 4 years ago

https://cmake.org/cmake/help/v3.0/command/add_library.html

add_libraryのMODULEが使えるか?

t-sakashita commented 4 years ago

上記の-fPICの問題は、動的ではなく、静的ライブラリlibEigenExa.aが作られてしまうことが原因であった。

これは、以下を削除することにより、解消した。

t-sakashita commented 4 years ago

rokko/benchmark/use_rokko/dense_minij_mpi/minij_mpi.cppが、eigenexaでサイズ100の場合に正常終了することを確認した。 submit_dense.pyを使用した。

使用したのは、Debugモード。

t-sakashita commented 4 years ago

Releaseモードで、EigenExaが終了しない。

./submit_dense.py eigenexa 100
t-sakashita commented 4 years ago

triは終了する。

t-sakashita commented 4 years ago

rokko/example/eigenexa/eigen_sx_f.f90でもエラー

mpijob CMD    : mpirun -f /tmp/mpiexec.params.k051500.67125
mpijob PARAMS : -d /home/k0515/k051500/jobscript/dense/example
        r1i5n20 4 omplace -nt 1 -c 0-1,20-21
        "/work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f"

SGI MPT Placement option
--------------------------
omplace -nt 1 -c 0-1,20-21

Node   MPI
------------
r1i5n20 4

 n =           8
 nprocs =           4
 nprow =           2
 npcol =           2
[k051500@enaga1 example]$ cat f_minij_mpi_eigenexa_100_proc4.e203857
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
    Process ID: 67240, Host: r1i5n20, Program: /work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f
    MPT Version: HPE MPT 2.16  06/02/17 01:08:38

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/67240/exe, process 67240
MPT: (no debugging symbols found)...done.
MPT: done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaac3e8f18c in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
MPT: To enable execution of this file add
MPT:    add-auto-load-safe-path /home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: To completely disable this security protection add
MPT:    set auto-load safe-path /
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
MPT:    info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libbitmask-2.0-sgi717r2.rhel74.x86_64 libcpuset-1.0-sgi717r3.rhel74.x86_64 libibverbs-41mlnx1-OFED.4.7.0.0.2.47100.x86_64 libmlx4-41mlnx1-OFED.4.5.0.0.3.47100.x86_64 libmlx5-41mlnx1-OFED.4.7.0.3.3.47100.x86_64 libnl3-3.2.28-4.el7.x86_64 libnuma-3.0sgi-sgi716r61.rhel73.x86_64 libxcb-1.13-1.el7.x86_64 numatools-2.0-sgi717r6.rhel74.x86_64
MPT: (gdb) #0  0x00002aaac3e8f18c in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaac45b9806 in mpi_sgi_system (
MPT: #2  MPI_SGI_stacktraceback (
MPT:     header=header@entry=0x7fffffffa440 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 67240, Host: r1i5n20, Program: /work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f\n\tMPT Version: HPE MPT 2.16  06/02/17 01:08:3"...) at sig.c:339
MPT: #3  0x00002aaac45b9a08 in first_arriver_handler (signo=signo@entry=11, 
MPT:     stack_trace_sem=stack_trace_sem@entry=0x2aaad1b00500) at sig.c:488
MPT: #4  0x00002aaac45b9deb in slave_sig_handler (signo=11, 
MPT:     siginfo=<optimized out>, extra=<optimized out>) at sig.c:563
MPT: #5  <signal handler called>
MPT: #6  0x00002aaab38c6c3e in eigen_prd_t2_mod_mp_eigen_prd_au_ ()
MPT:    from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #7  0x00002aaab38d3daa in eigen_prd_mod_mp_eigen_prd_body_ ()
MPT:    from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #8  0x00002aaab38d3649 in eigen_prd_mod_mp_eigen_prd_stub_ ()
MPT:    from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #9  0x00002aaab2536a43 in __kmp_invoke_microtask ()
MPT:    from /home/app/intel/compilers_and_libraries_2018.5.274/linux/compiler/lib/intel64/libiomp5.so
MPT: #10 0x00002aaab24fa2c6 in __kmp_fork_call (loc=0x0, gtid=0, 
MPT:     call_context=fork_context_gnu, argc=0, microtask=0x0, invoker=0x0, 
MPT:     ap=0x7fffffffc230) at ../../src/kmp_runtime.cpp:2113
MPT: #11 0x00002aaab24b9bb0 in __kmpc_fork_call (loc=0x0, argc=0, microtask=0x0)
MPT:     at ../../src/kmp_csupport.cpp:365
MPT: #12 0x00002aaab38d2ebf in eigen_prd_mod_mp_eigen_prd_stub_ ()
MPT:    from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #13 0x00002aaab38d1e1c in eigen_prd_mod_mp_eigen_prd_ ()
MPT:    from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #14 0x00002aaab38dc091 in eigen_sx_ ()
MPT:    from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #15 0x00000000004060ef in MAIN__ ()
MPT: #16 0x000000000040596e in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT:    Inferior 1 [process 67240] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/67240/exe, process 67240

MPT: -----stack traceback ends-----
MPT: On host r1i5n20, Program /work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f, Rank 0, Process 67240: Dumping core on signal SIGSEGV(11) into directory /home/k0515/k051500/jobscript/dense/example
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
    aborting job
MPT: Received signal 11
t-sakashita commented 4 years ago

rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa.f90は正常終了した。

比較する。

t-sakashita commented 4 years ago

Debugモードで実行した。

MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
    Process ID: 69842, Host: r1i5n4, Program: /work/k0515/k051500/build/rokko_debug/example/eigenexa/eigen_sx_f
    MPT Version: HPE MPT 2.16  06/02/17 01:08:38

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/69842/exe, process 69842
MPT: (no debugging symbols found)...done.
MPT: done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaacaa3718c in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
MPT: To enable execution of this file add
MPT:    add-auto-load-safe-path /home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: To completely disable this security protection add
MPT:    set auto-load safe-path /
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
MPT:    info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libbitmask-2.0-sgi717r2.rhel74.x86_64 libcpuset-1.0-sgi717r3.rhel74.x86_64 libibverbs-41mlnx1-OFED.4.7.0.0.2.47100.x86_64 libmlx4-41mlnx1-OFED.4.5.0.0.3.47100.x86_64 libmlx5-41mlnx1-OFED.4.7.0.3.3.47100.x86_64 libnl3-3.2.28-4.el7.x86_64 libnuma-3.0sgi-sgi716r61.rhel73.x86_64 libxcb-1.13-1.el7.x86_64 numatools-2.0-sgi717r6.rhel74.x86_64
MPT: (gdb) #0  0x00002aaacaa3718c in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaacb161806 in mpi_sgi_system (
MPT: #2  MPI_SGI_stacktraceback (
MPT:     header=header@entry=0x7fffffff9740 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 69842, Host: r1i5n4, Program: /work/k0515/k051500/build/rokko_debug/example/eigenexa/eigen_sx_f\n\tMPT Version: HPE MPT 2.16  06/02/17 01"...) at sig.c:339
MPT: #3  0x00002aaacb161a08 in first_arriver_handler (signo=signo@entry=11, 
MPT:     stack_trace_sem=stack_trace_sem@entry=0x2aaadd660500) at sig.c:488
MPT: #4  0x00002aaacb161deb in slave_sig_handler (signo=11, 
MPT:     siginfo=<optimized out>, extra=<optimized out>) at sig.c:563
MPT: #5  <signal handler called>
MPT: #6  0x00002aaab40e31af in eigen_prd_t2_mod::eigen_prd_au (a=..., nm=80, 
MPT:     u_x=..., u_y=..., v_x=..., nv=48, u_t=..., v_t=..., d_t=..., i=8, 
MPT:     i_base=2, m=6)
MPT:     at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd_t2.F:174
MPT: #7  0x00002aaab40f74dd in eigen_prd_mod::eigen_prd_body (a=..., nm=80, 
MPT:     d_out=..., e_out=..., ne=8, n=8, m_orig=48, w=..., u_x=..., u_y=..., 
MPT:     v_x=..., v_y=..., u_t=..., v_t=..., d_t=..., nv=48)
MPT:     at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:493
MPT: #8  0x00002aaab40f655c in eigen_prd_mod::L_eigen_prd_mod_mp_eigen_prd_stub__280__par_region0_2_0 ()
MPT:     at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:281
MPT: #9  0x00002aaab2d2ea43 in __kmp_invoke_microtask ()
MPT:    from /home/app/intel/compilers_and_libraries_2018.5.274/linux/compiler/lib/intel64/libiomp5.so
MPT: #10 0x00002aaab2cf22c6 in __kmp_fork_call (loc=0x7fffffffa8a8, 
MPT:     gtid=-1271736648, call_context=fork_context_gnu, argc=2, 
MPT:     microtask=0x2aaaaab11090, invoker=0x7fffffffbe10, ap=0x7fffffffbc90)
MPT:     at ../../src/kmp_runtime.cpp:2113
MPT: #11 0x00002aaab2cb1bb0 in __kmpc_fork_call (loc=0x7fffffffa8a8, 
MPT:     argc=-1271736648, microtask=0x0) at ../../src/kmp_csupport.cpp:365
MPT: #12 0x00002aaab40f575c in eigen_prd_mod::eigen_prd_stub (a=..., nm=80, 
MPT:     d_out=..., e_out=..., ne=8, n=8, m_orig=48)
MPT:     at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:280
MPT: #13 0x00002aaab40f365c in eigen_prd_mod::eigen_prd (n=8, a=..., nma0=80, 
MPT:     d_out=..., e_out=..., nme0=8, m_orig=48)
MPT:     at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:134
MPT: #14 0x00002aaab41043fe in eigen_sx (n=8, nvec=8, a=..., lda=80, w=..., z=..., 
MPT:     ldz=80, m_forward=48, m_backward=128, mode='A', .tmp.MODE.len_V$a5=1)
MPT:     at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_sx.F:181
MPT: #15 0x000000000040626b in main ()
MPT:     at /home/k0515/k051500/development/rokko/example/eigenexa/eigen_sx_f.f90:41
MPT: #16 0x000000000040562e in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT:    Inferior 1 [process 69842] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/69842/exe, process 69842

MPT: -----stack traceback ends-----
MPT: On host r1i5n4, Program /work/k0515/k051500/build/rokko_debug/example/eigenexa/eigen_sx_f, Rank 0, Process 69842: Dumping core on signal SIGSEGV(11) into directory /home/k0515/k051500/jobscript/dense/example
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
    aborting job
MPT: Received signal 11
t-sakashita commented 4 years ago

EigenExa付属のベンチマークプログラムmain2を実行してみた。 入力ファイルINを、以下のよう書き換えた。

!   N  nvec bx  by m t s e
 10  10 48 128 1 0 0 1
 100  100 48 128 1 0 0 1
-1

上記の変更点:

enagaでの実行が正常終了した。

t-sakashita commented 4 years ago

付属プログラムmain2

seg faultとなる。

t-sakashita commented 4 years ago

Intel MPIと最新のインテルコンパイラで試す。

module list
Currently Loaded Modulefiles:
  1) intel/20.0.1      2) intel-mpi/5.1.3
./configure CC=mpiicc FC=mpiifort CFLAGS="-g" FFLAGS="-g"

ジョブスクリプトに追記した。

cd ${PBS_O_WORKDIR}
. /etc/profile.d/modules.sh
module unload gcc
module unload intel/18.0.5
module unload mpt
module load intel/20.0.1
module load intel-mpi
t-sakashita commented 4 years ago

Intelでも、seg fault

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
eigenexa_benchmar  00000000004981EA  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AAAB30CE630  Unknown               Unknown  Unknown
eigenexa_benchmar  0000000000459588  Unknown               Unknown  Unknown
eigenexa_benchmar  000000000043FEB9  Unknown               Unknown  Unknown
eigenexa_benchmar  000000000043E83B  Unknown               Unknown  Unknown
eigenexa_benchmar  000000000043CB8E  Unknown               Unknown  Unknown
eigenexa_benchmar  000000000042CA92  Unknown               Unknown  Unknown
eigenexa_benchmar  000000000041A79D  Unknown               Unknown  Unknown
eigenexa_benchmar  0000000000406E62  Unknown               Unknown  Unknown
libc-2.17.so       00002AAAB4578545  __libc_start_main     Unknown  Unknown
eigenexa_benchmar  0000000000406D69  Unknown               Unknown  Unknown
t-sakashita commented 4 years ago

開発者に報告する。

t-sakashita commented 4 years ago

デバッグ情報を出力するには、SGI MPTでなければダメか?

SGI MPTを用いる。 コンパイルオプション-tracebackを付けてみる。

./configure CFLAGS="-g -traceback" FFLAGS="-g -traceback"