Open t-sakashita opened 5 years ago
minij_mpi_thread_single
を試した。
以下も終了しない。
minij_mpi_thread_single eigenexa:penta 100
ELPAは正常終了した。
EigenExa付属のベンチマークプログラムmain2を実行してみる。
途中までは実行できた。
SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35
Node MPI
------------
r1i6n0 4
r1i6n1 4
r1i6n3 4
r1i6n4 4
INPUT FILE='IN'
======================================================
## EigenExa version (2.4b) / (August 20, 2018) / (Akashi)
Solver = eigen_s / via tri-diagonal format
Block width = 48 / 128
NUM.OF.PROCESS= 16 ( 4 4 )
NUM.OF.THREADS= 8
Matrix dimension = 10
Matrix type = 0 (Frank matrix)
Internally required memory = 3355280 [Byte]
The number of eigenvectors computed = 10
mode 'A' :: all the eigenpairs
Elapsed time = 0.331598039716482 [sec]
FLOP = 3333.33333333333
Performance = 1.005233123869897E-005 [GFLOPS]
* Since FLOPs on D&C could not be counted up correctly, above performance
is lower than the actual, which could be 10-25 % higher :
( 1.105756460223511E-005 - 1.256541404837370E-005 )
-----------------------------------------------
cond(A)=|w_max|/|w_min|= 44.7660686527145 / 0.255679562796436
= 175.086612958408
max|w(i)-w(i).true|/|w.true|= 1.222170994687292E-014 44.7660686527151
*** Eigenvalue Relative Error *** : PASSED
max|w(i)-w(i).true| = 5.471179065352771E-013 44.7660686527151
*** Eigenvalue Absolute Error *** : PASSED
-----------------------------------------------
|A|_{1}= 55.0000000000000
epsilon= 2.220446049250313E-016
max|Ax-wx|_{1}/Ne|A|_{1}= 0.280000000000000 10
*** Residual Error Test *** : PASSED
|ZZ-I|_{F}/sqrt(N)= 1.141308048004556E-015
*** Orthogonality Test *** : PASSED
======================================================
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
{ 0, 0}: On entry to
DSTEQR parameter number -62 had an illegal value
ファイルIN
を、最初の1行だけ有効にした:
10 10 48 128 1 0 1 1
-1
出力:
SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35
Node MPI
------------
r1i5n8 4
r1i5n9 4
r1i5n10 4
r1i5n11 4
INPUT FILE='IN'
======================================================
## EigenExa version (2.4b) / (August 20, 2018) / (Akashi)
Solver = eigen_s / via tri-diagonal format
Block width = 48 / 128
NUM.OF.PROCESS= 16 ( 4 4 )
NUM.OF.THREADS= 8
Matrix dimension = 10
Matrix type = 0 (Frank matrix)
Internally required memory = 3355280 [Byte]
The number of eigenvectors computed = 10
mode 'A' :: all the eigenpairs
Elapsed time = 6.523737916722894E-002 [sec]
FLOP = 3333.33333333333
Performance = 5.109545134835498E-005 [GFLOPS]
* Since FLOPs on D&C could not be counted up correctly, above performance
is lower than the actual, which could be 10-25 % higher :
( 5.620499770140097E-005 - 6.386931418544373E-005 )
-----------------------------------------------
cond(A)=|w_max|/|w_min|= 44.7660686527145 / 0.255679562796436
= 175.086612958408
max|w(i)-w(i).true|/|w.true|= 1.222170994687292E-014 44.7660686527151
*** Eigenvalue Relative Error *** : PASSED
max|w(i)-w(i).true| = 5.471179065352771E-013 44.7660686527151
*** Eigenvalue Absolute Error *** : PASSED
-----------------------------------------------
|A|_{1}= 55.0000000000000
epsilon= 2.220446049250313E-016
max|Ax-wx|_{1}/Ne|A|_{1}= 0.280000000000000 10
*** Residual Error Test *** : PASSED
|ZZ-I|_{F}/sqrt(N)= 1.141308048004556E-015
*** Orthogonality Test *** : PASSED
======================================================
Benchmark completed
正常終了した。
no_rokkoにおいて、サブルーチンeigen_sx
の呼び出しをコメントアウトしてみる。
no_rokko/dense_minij_mpi/eigen_exa.f90において、ソルバ呼び出しをコメントアウトしてみる:
!call eigen_sx( n, n, a, nm, w, z, nm, 48, 128, 'A')
正常終了した。
Debugモードでビルドし、デバッガが呼び出されるようにする。
以下のスクリプトでは、正常終了する。
DIM=100
#QSUB -queue L4cpu
#QSUB -node 1
#QSUB -mpi 1
#QSUB -omp 8
#QSUB -place distribute
DIM=10でも正常終了した。
MPIプロセス数を4にしてみる。
DIM=100
#QSUB -queue L4cpu
#QSUB -node 1
#QSUB -mpi 1
#QSUB -omp 8
#QSUB -place distribute
実行時エラー:
mpijob CMD : mpirun -f /tmp/mpiexec.params.k001007.77375
mpijob PARAMS : -d /home/k0010/k001007/jobscript/minij_mpi/no_rokko
r1i5n10 4 omplace -nt 8 -c 0-15,20-35
"/work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa"
"100"
SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35
Node MPI
------------
r1i5n10 4
[k001007@enaga1 no_rokko]$ cat minij_mpi_eigen_exa_100_distribute_4proc.sh.e176940
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
Process ID: 77522, Host: r1i5n10, Program: /work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa
MPT Version: HPE MPT 2.16 06/02/17 01:08:38
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/77522/exe, process 77522
MPT: (no debugging symbols found)...done.
MPT: done.
MPT: [New LWP 77550]
MPT: [New LWP 77546]
MPT: [New LWP 77542]
MPT: [New LWP 77537]
MPT: [New LWP 77534]
MPT: [New LWP 77530]
MPT: [New LWP 77526]
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaac31651d9 in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
MPT: To enable execution of this file add
MPT: add-auto-load-safe-path /home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home/k0010/k001007/.gdbinit".
MPT: To completely disable this security protection add
MPT: set auto-load safe-path /
MPT: line to your configuration file "/home/k0010/k001007/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
MPT: info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libbitmask-2.0-sgi717r2.rhel74.x86_64 libcpuset-1.0-sgi717r3.rhel74.x86_64 libibverbs-41mlnx1-OFED.4.7.0.0.2.47100.x86_64 libmlx4-41mlnx1-OFED.4.5.0.0.3.47100.x86_64 libmlx5-41mlnx1-OFED.4.7.0.3.3.47100.x86_64 libnl3-3.2.28-4.el7.x86_64 libnuma-3.0sgi-sgi716r61.rhel73.x86_64 libxcb-1.13-1.el7.x86_64 numatools-2.0-sgi717r6.rhel74.x86_64
MPT: (gdb) #0 0x00002aaac31651d9 in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaac388f806 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffffae80 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 77522, Host: r1i5n10, Program: /work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa\n\tMPT Version: HPE MPT 2.16 "...) at sig.c:339
MPT: #3 0x00002aaac388fa08 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaacf9a0500) at sig.c:488
MPT: #4 0x00002aaac388fdeb in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:563
MPT: #5 <signal handler called>
MPT: #6 load_with_acquire (location=<optimized out>)
MPT: at ../../include/tbb/tbb_machine.h:611
MPT: #7 __TBB_load_with_acquire (location=<optimized out>)
MPT: at ../../include/tbb/tbb_machine.h:714
MPT: #8 FencedLoad (location=<optimized out>)
MPT: at ../../src/tbbmalloc/Customize.h:109
MPT: #9 tryLock (this=0xbcc366526ce601c8, state=<optimized out>)
MPT: at ../../src/tbbmalloc/backend.cpp:220
MPT: #10 trySetLeftUsed (this=<optimized out>, s=<optimized out>)
MPT: at ../../src/tbbmalloc/backend.cpp:284
MPT: #11 tryLockBlock (this=<optimized out>) at ../../src/tbbmalloc/backend.cpp:291
MPT: #12 rml::internal::Backend::IndexedBins::getFromBin (
MPT: this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>,
MPT: binIdx=0, sync=0x2aaab27ade00 <nolibxml.1146.0.13>, size=46912627203680,
MPT: needAlignedRes=false, alignedBin=true, wait=false,
MPT: binLocked=0x7fffffffc118) at ../../src/tbbmalloc/backend.cpp:439
MPT: #13 0x00002aaab25229ed in rml::internal::Backend::IndexedBins::findBlock (
MPT: this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>,
MPT: nativeBin=0, sync=0x2aaab27ade00 <nolibxml.1146.0.13>,
MPT: size=46912627203680, resSlabAligned=false, alignedBin=true,
MPT: numOfLockedBins=0xae000) at ../../src/tbbmalloc/backend.cpp:823
MPT: #14 0x00002aaab2522920 in rml::internal::Backend::genericGetBlock (
MPT: this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, num=0,
MPT: size=46912627203584, needAlignedBlock=96)
MPT: at ../../src/tbbmalloc/backend.cpp:886
MPT: #15 0x00002aaab2523949 in rml::internal::Backend::getLargeBlock (
MPT: this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, size=0)
MPT: at ../../src/tbbmalloc/backend.cpp:927
MPT: #16 0x00002aaab2524720 in rml::internal::ExtMemoryPool::mallocLargeObject (
MPT: this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>,
MPT: pool=0x0, allocationSize=46912627203584)
MPT: at ../../src/tbbmalloc/large_objects.cpp:915
MPT: #17 0x00002aaab251c603 in rml::internal::MemoryPool::getFromLLOCache (
MPT: this=0x2aaab27b0f40 <rml::internal::defaultMemPool_space+12576>, tls=0x0,
MPT: size=46912627203584, alignment=46912627203680)
MPT: at ../../src/tbbmalloc/frontend.cpp:2255
MPT: #18 0x00002aaab251cf8b in allocateAligned (memPool=<optimized out>,
MPT: size=<optimized out>, alignment=<optimized out>)
MPT: at ../../src/tbbmalloc/frontend.cpp:2351
MPT: #19 scalable_aligned_malloc (size=46912627216192, alignment=0)
MPT: at ../../src/tbbmalloc/frontend.cpp:3048
MPT: #20 0x00002aaabb8ddae6 in for_allocate ()
MPT: from /work/k0010/k001007/rokko/petsc-3.12.0-1/Release/lib/libpetsc.so.3.12
MPT: #21 0x00002aaab387ceb3 in trbakwy4_mod_mp_eigen_common_trbakwy_ ()
MPT: from /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #22 0x00002aaab3882b7b in eigen_sx_ ()
MPT: from /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #23 0x000000000040703e in MAIN__ ()
MPT: #24 0x00000000004062ae in main ()
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 77522] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/77522/exe, process 77522
MPT: -----stack traceback ends-----
MPT: On host r1i5n10, Program /work/k0010/k001007/build/rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa, Rank 0, Process 77522: Dumping core on signal SIGSEGV(11) into directory /home/k0010/k001007/jobscript/minij_mpi/no_rokko
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
なぜ、MPT: from /work/k0010/k001007/rokko/petsc-3.12.0-1/Release/lib/libpetsc.so.3.12が関与するのか?
以下より、セグメントーションフォールトが起きているようだ。
Dumping core on signal SIGSEGV(11) into directory /home/k0010/k001007/jobscript/minij_mpi/no_rokko
Fortranのオプション引数を、Cバインディングに渡す際にエラーが起こっているのではないか?
Debugモードで、MPIプロセス数16で、benchmark/use_rokko/dense_minij_mpi/minij_mpiが正常終了した。
デバッグができない。。
eigen_sで試した。
benchmark/no_rokko/dense_minij_mpi/eigen_s.f90を作成した。 eigen_sxと同様に、止まってしまった。
Rokko付属のインストールスクリプトを使用しない。
EigenExa付属のconfigureを使ってみる。
configureにより、以下のMakefileが作成される。
デフォルトでは、Intel MPIのBLACSがリンクされてしまう。(-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl
)
それらを全て、以下のように、リンクするライブラリをSGI-MPTに書き換える:
#OPT_LD_LAPACK = -I/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/include -L/home/app/intel/compilers_and_libraries\
_2018.5.274/linux/mkl/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -\
liomp5 -lpthread -lm -ldl
OPT_LD_LAPACK = -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallel
作成された実行ファイルeigenexa_benchmarkを実行してみた。
MPIプロセス16=4x4で、正常終了した。
ビルドモードは、リリースのはず。
入力ファイルIN
の内容は短くせず、そのまま用いた。
Rokkoのインストールスクリプトの問題のようだ。 SGIMPTに関するオプションを確認する。
以下のオプションの指定を除いてみる。
-DCMAKE_C_FLAGS="-mt" -DCMAKE_Fortran_FLAGS="-mt" \
-DMPI_C_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" -DMPI_Fortran_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" \
-DMPI_C_LIBRARIES="-mt" -DMPI_Fortran_LIBRARIES="-mt" \
逐次のコンパイラが使われていた。
オプションを取り除いて、並列コンパイラmpif90等を使うようにした。
入力ファイルIN
の2行目で、以下のエラーとなる。
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
{ 0, 0}: On entry to
DSTEQR parameter number -62 had an illegal value
動的ではなく、静的ライブラリlibEigenExa.aを作る。
option(BUILD_SHARED_LIBS "Build shared libraries" OFF)
configureを使った時に、付加されたコンパイルオプションは、以下の通り:
-qopenmp -Ofast -xHOST -fpp -fp-model strict -fPIC
Rokkoで検出されたScaLAPACKは、以下の通り:
ScaLAPACK libraries: /home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so;/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_blacs_sgimpt_lp64.so
念のため、普遍的なオプションである-lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallel
を陽に指定してみる。
configureの場合を参考に、プリプロセッサ定数は、以下のみに変更した。
add_definitions(-DTIMER_PRINT=0)
Rokkoで検出されたBLASを使うのが原因か?
Found BLAS: /home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so;/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so;/home/app/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64_lin/libmkl_core.so;/home/app/freeware/gcc/6.3.0/lib64/libgomp.so;-lm;-ldl
Rokkoに関係なく、EigenExa単体EigenExa-2.4b-build-Release/benchmark/main2
の実行も、同様のエラーとなった。
mpijob CMD : mpirun -f /tmp/mpiexec.params.k001007.193406
mpijob PARAMS : -d /work/k0010/k001007/build/EigenExa-2.4b-build-Release/benchmark
r1i6n0 4, r1i6n1 4, r1i6n5 4, r1i6n8 4 omplace -nt 8 -c 0-15,20-35
"/work/k0010/k001007/build/EigenExa-2.4b-build-Release/benchmark/main2"
SGI MPT Placement option
--------------------------
omplace -nt 8 -c 0-15,20-35
Node MPI
------------
r1i6n0 4
r1i6n1 4
r1i6n5 4
r1i6n8 4
INPUT FILE='IN'
======================================================
## EigenExa version (2.4b) / (August 20, 2018) / (Akashi)
Solver = eigen_s / via tri-diagonal format
Block width = 48 / 128
NUM.OF.PROCESS= 16 ( 4 4 )
NUM.OF.THREADS= 8
Matrix dimension = 10
Matrix type = 0 (Frank matrix)
Internally required memory = 1661106696 [Byte]
The number of eigenvectors computed = 10
mode 'A' :: all the eigenpairs
Elapsed time = 7.078798103611916E-002 [sec]
FLOP = 3333.33333333333
Performance = 4.708897307909543E-005 [GFLOPS]
* Since FLOPs on D&C could not be counted up correctly, above performance
is lower than the actual, which could be 10-25 % higher :
( 5.179787150969358E-005 - 5.886121634886928E-005 )
-----------------------------------------------
cond(A)=|w_max|/|w_min|= 44.7660686527145 / 0.255679562796436
= 175.086612958408
max|w(i)-w(i).true|/|w.true|= 1.222170994687292E-014 44.7660686527151
*** Eigenvalue Relative Error *** : PASSED
max|w(i)-w(i).true| = 5.471179065352771E-013 44.7660686527151
*** Eigenvalue Absolute Error *** : PASSED
-----------------------------------------------
|A|_{1}= 55.0000000000000
epsilon= 2.220446049250313E-016
max|Ax-wx|_{1}/Ne|A|_{1}= 0.280000000000000 10
*** Residual Error Test *** : PASSED
|ZZ-I|_{F}/sqrt(N)= 1.141308048004556E-015
*** Orthogonality Test *** : PASSED
======================================================
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
{ 0, 0}: On entry to
DSTEQR parameter number -62 had an illegal value
[k001007@enaga1 no_rokko]$ cat main2_build.e178798
=>> PBS: job killed: walltime 2423 exceeded limit 2400
MPT: Received signal 15
ScaLAPACKとのリンク対象
configureに揃えてみる。
/home/app/hpe/mpt/2.16/bin/mpif90 -qopenmp -O3 CMakeFiles/main2.dir/main2.f.o CMakeFiles/main2.dir/mat_set.f.o CMakeFiles/main2.dir/ev_test.f.o CMakeFiles/main2.dir/w_test.f.o -o main2 ../src/libEigenExa.a -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallel -lirng -ldecimal -lcilkrts -lstdc++
上記を見ると、-lstdc++
がリンクされている。
リンカはmpif90かmpiccか?
EigenExa-2.4/benchmark/CMakeLists.txt
には以下のようにある。
if(USE_C_LINKER)
set_target_properties(main2 PROPERTIES LINKER_LANGUAGE C)
endif(USE_C_LINKER)
リンク時には、mpif90が使われていた。
Rokko同梱のインストールスクリプトをいじってみる。
diff --git a/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh b/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh
index 9585911e..ba176b0d 100644
--- a/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh
+++ b/3rd-party/install/EigenExa/intel-mkl-sgimpt.sh
@@ -17,13 +17,10 @@ for build_type in $BUILD_TYPES; do
cd EigenExa-$EIGENEXA_VERSION-build-$build_type
check cmake -DCMAKE_BUILD_TYPE=$build_type -DCMAKE_INSTALL_PREFIX=$PREFIX \
-DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpif90 \
- -DCMAKE_C_FLAGS="-mt" -DCMAKE_Fortran_FLAGS="-mt" \
-DMPI_C_COMPILER=mpicc -DMPI_Fortran_COMPILER=mpif90 \
- -DMPI_C_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" -DMPI_Fortran_INCLUDE_PATH="/home/app/mpt/mpt-2.14-p11333/include" \
- -DMPI_C_LIBRARIES="-mt" -DMPI_Fortran_LIBRARIES="-mt" \
-DSCALAPACK_LIB="-lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 -mkl=parallel" \
$BUILD_DIR/EigenExa-$EIGENEXA_VERSION
- check make -j2
+ check make -j2 VERBOSE=1
$SUDO make install
done
ScaLAPACKのリンクをmain2だけにしてみた。
--- a/3rd-party/install/EigenExa/EigenExa-2.4b.patch
+++ b/3rd-party/install/EigenExa/EigenExa-2.4b.patch
@@ -196,7 +196,7 @@ diff -crN EigenExa-2.4b.orig/benchmark/CMakeLists.txt EigenExa-2.4b/benchmark/CM
+
+ add_executable(main2 main2.f mat_set.f ev_test.f w_test.f)
+ target_include_directories(main2 PUBLIC ${CMAKE_BINARY_DIR}/src/modules)
-+ target_link_libraries(main2 EigenExa)
++ target_link_libraries(main2 EigenExa ${SCALAPACK_LIBRARIES})
+
+ if(USE_C_LINKER)
+ set_target_properties(main2 PROPERTIES LINKER_LANGUAGE C)
@@ -226,7 +226,7 @@ diff -crN EigenExa-2.4b.orig/src/CMakeLists.txt EigenExa-2.4b/src/CMakeLists.txt
+ trbakwy4_body.F trbakwy4.F
+ eigen_scaling.F eigen_sx.F eigen_s.F)
+ add_library(EigenExa ${SOURCES})
-+ target_link_libraries(EigenExa ${SCALAPACK_LIBRARIES} ${MPI_Fortran_LIBRARIES})
++ !!target_link_libraries(EigenExa ${MPI_Fortran_LIBRARIES})
+ set_target_properties(EigenExa PROPERTIES Fortran_MODULE_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/modules)
+ install(TARGETS EigenExa ARCHIVE DESTINATION lib LIBRARY DESTINATION lib RUNTIME DESTINATION bin)
+ install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/modules/ DESTINATION include)
その結果、同様のエラーが出た。
OpenMPを外してみる。
configureでは、コンパイルオプション-fp-model strict
が付けられていた。
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-fp-model-fp
CMakeのコマンドライン引数に、以下を追加した。
-DCMAKE_Fortran_FLAGS="-fp-model strict" \
EigenExa-2.4b-build-Release/benchmark/main2
の実行が正常終了した。
原因は、コンパイルオプション-fp-model strict
がなかったことと判明した。
インストールスクリプトに-DCMAKE_Fortran_FLAGS="-fp-model strict"
を追加した。
96b97b36b6e59bccd1d7077f18933c394cc79307
Rokkoでのコンパイルエラー
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_libs.F.o): relocation R_X86_64_32 against symbol `eigen_libs_mod_mp_eigen_version_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_sx.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_s.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_blacs.F.o): relocation R_X86_64_32 against undefined symbol `eigen_blacs_mod_mp_blacs_icontxt_for_eigen_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_devel.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_time_bcast__' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(comm.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dc_redist1.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dc_redist2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dc2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dcx.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(bisect.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(bisect2.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd.F.o): relocation R_X86_64_32 against symbol `eigen_devel_mod_mp_trd_comm_world_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd.F.o): relocation R_X86_64_32 against symbol `eigen_devel_mod_mp_trd_comm_world_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(trbakwy4.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_scaling.F.o): relocation R_X86_64_32 against `__STRLITPACK_83' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dlaed6_init.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdsxedc.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdstedc.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_t1.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t2.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t4.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t5.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t5x.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t6_3.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t7.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_inod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_trd_t8.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t4x.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t5.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t6_3.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_x_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t7.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(eigen_prd_t8.F.o): relocation R_X86_64_32 against undefined symbol `eigen_devel_mod_mp_y_nnod_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(trbakwy4_body.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed0.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlasrt.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed0.F.o): relocation R_X86_64_32 against `__STRLITPACK_84' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(dlacpy.F.o): relocation R_X86_64_32 against `__STRLITPACK_1' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaedz.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed1.F.o): relocation R_X86_64_32 against `__STRLITPACK_89' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed1.F.o): relocation R_X86_64_32 against `__STRLITPACK_87' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed3.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(my_pdlaed2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaedz.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed3.F.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(mx_pdlaed2.F.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: /work/k0010/k001007/rokko/eigenexa/eigenexa-2.4b-12/Release/lib/libEigenExa.a(CSTAB.F.o): relocation R_X86_64_PC32 against symbol `get_delta_' can not be used when making a shared object。 -fPIC を付けて再コンパイルしてください。
ld: 最終リンクに失敗しました: 不正な値です
set_property(TARGET EigenExa PROPERTY POSITION_INDEPENDENT_CODE ON)
試した変更:
-+ add_library(EigenExa ${SOURCES})
++ add_library(EigenExa SHARED ${SOURCES})
++ set_property(TARGET EigenExa PROPERTY POSITION_INDEPENDENT_CODE ON)
https://cmake.org/cmake/help/v3.0/command/add_library.html
add_libraryのMODULEが使えるか?
上記の-fPIC
の問題は、動的ではなく、静的ライブラリlibEigenExa.aが作られてしまうことが原因であった。
これは、以下を削除することにより、解消した。
rokko/benchmark/use_rokko/dense_minij_mpi/minij_mpi.cppが、eigenexaでサイズ100の場合に正常終了することを確認した。 submit_dense.pyを使用した。
使用したのは、Debugモード。
Releaseモードで、EigenExaが終了しない。
./submit_dense.py eigenexa 100
triは終了する。
rokko/example/eigenexa/eigen_sx_f.f90でもエラー
mpijob CMD : mpirun -f /tmp/mpiexec.params.k051500.67125
mpijob PARAMS : -d /home/k0515/k051500/jobscript/dense/example
r1i5n20 4 omplace -nt 1 -c 0-1,20-21
"/work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f"
SGI MPT Placement option
--------------------------
omplace -nt 1 -c 0-1,20-21
Node MPI
------------
r1i5n20 4
n = 8
nprocs = 4
nprow = 2
npcol = 2
[k051500@enaga1 example]$ cat f_minij_mpi_eigenexa_100_proc4.e203857
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
Process ID: 67240, Host: r1i5n20, Program: /work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f
MPT Version: HPE MPT 2.16 06/02/17 01:08:38
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/67240/exe, process 67240
MPT: (no debugging symbols found)...done.
MPT: done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaac3e8f18c in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
MPT: To enable execution of this file add
MPT: add-auto-load-safe-path /home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: To completely disable this security protection add
MPT: set auto-load safe-path /
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
MPT: info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libbitmask-2.0-sgi717r2.rhel74.x86_64 libcpuset-1.0-sgi717r3.rhel74.x86_64 libibverbs-41mlnx1-OFED.4.7.0.0.2.47100.x86_64 libmlx4-41mlnx1-OFED.4.5.0.0.3.47100.x86_64 libmlx5-41mlnx1-OFED.4.7.0.3.3.47100.x86_64 libnl3-3.2.28-4.el7.x86_64 libnuma-3.0sgi-sgi716r61.rhel73.x86_64 libxcb-1.13-1.el7.x86_64 numatools-2.0-sgi717r6.rhel74.x86_64
MPT: (gdb) #0 0x00002aaac3e8f18c in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaac45b9806 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffffa440 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 67240, Host: r1i5n20, Program: /work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f\n\tMPT Version: HPE MPT 2.16 06/02/17 01:08:3"...) at sig.c:339
MPT: #3 0x00002aaac45b9a08 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaad1b00500) at sig.c:488
MPT: #4 0x00002aaac45b9deb in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:563
MPT: #5 <signal handler called>
MPT: #6 0x00002aaab38c6c3e in eigen_prd_t2_mod_mp_eigen_prd_au_ ()
MPT: from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #7 0x00002aaab38d3daa in eigen_prd_mod_mp_eigen_prd_body_ ()
MPT: from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #8 0x00002aaab38d3649 in eigen_prd_mod_mp_eigen_prd_stub_ ()
MPT: from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #9 0x00002aaab2536a43 in __kmp_invoke_microtask ()
MPT: from /home/app/intel/compilers_and_libraries_2018.5.274/linux/compiler/lib/intel64/libiomp5.so
MPT: #10 0x00002aaab24fa2c6 in __kmp_fork_call (loc=0x0, gtid=0,
MPT: call_context=fork_context_gnu, argc=0, microtask=0x0, invoker=0x0,
MPT: ap=0x7fffffffc230) at ../../src/kmp_runtime.cpp:2113
MPT: #11 0x00002aaab24b9bb0 in __kmpc_fork_call (loc=0x0, argc=0, microtask=0x0)
MPT: at ../../src/kmp_csupport.cpp:365
MPT: #12 0x00002aaab38d2ebf in eigen_prd_mod_mp_eigen_prd_stub_ ()
MPT: from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #13 0x00002aaab38d1e1c in eigen_prd_mod_mp_eigen_prd_ ()
MPT: from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #14 0x00002aaab38dc091 in eigen_sx_ ()
MPT: from /work/k0515/k051500/rokko/eigenexa/eigenexa-2.4b-2/Release/lib/libEigenExa.so
MPT: #15 0x00000000004060ef in MAIN__ ()
MPT: #16 0x000000000040596e in main ()
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 67240] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/67240/exe, process 67240
MPT: -----stack traceback ends-----
MPT: On host r1i5n20, Program /work/k0515/k051500/build/rokko/example/eigenexa/eigen_sx_f, Rank 0, Process 67240: Dumping core on signal SIGSEGV(11) into directory /home/k0515/k051500/jobscript/dense/example
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
rokko/benchmark/no_rokko/dense_minij_mpi/eigen_exa.f90は正常終了した。
比較する。
Debugモードで実行した。
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
Process ID: 69842, Host: r1i5n4, Program: /work/k0515/k051500/build/rokko_debug/example/eigenexa/eigen_sx_f
MPT Version: HPE MPT 2.16 06/02/17 01:08:38
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/69842/exe, process 69842
MPT: (no debugging symbols found)...done.
MPT: done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaacaa3718c in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
MPT: To enable execution of this file add
MPT: add-auto-load-safe-path /home/app/freeware/gcc/6.3.0/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: To completely disable this security protection add
MPT: set auto-load safe-path /
MPT: line to your configuration file "/home/k0515/k051500/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
MPT: info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libbitmask-2.0-sgi717r2.rhel74.x86_64 libcpuset-1.0-sgi717r3.rhel74.x86_64 libibverbs-41mlnx1-OFED.4.7.0.0.2.47100.x86_64 libmlx4-41mlnx1-OFED.4.5.0.0.3.47100.x86_64 libmlx5-41mlnx1-OFED.4.7.0.3.3.47100.x86_64 libnl3-3.2.28-4.el7.x86_64 libnuma-3.0sgi-sgi716r61.rhel73.x86_64 libxcb-1.13-1.el7.x86_64 numatools-2.0-sgi717r6.rhel74.x86_64
MPT: (gdb) #0 0x00002aaacaa3718c in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaacb161806 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffff9740 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 69842, Host: r1i5n4, Program: /work/k0515/k051500/build/rokko_debug/example/eigenexa/eigen_sx_f\n\tMPT Version: HPE MPT 2.16 06/02/17 01"...) at sig.c:339
MPT: #3 0x00002aaacb161a08 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaadd660500) at sig.c:488
MPT: #4 0x00002aaacb161deb in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:563
MPT: #5 <signal handler called>
MPT: #6 0x00002aaab40e31af in eigen_prd_t2_mod::eigen_prd_au (a=..., nm=80,
MPT: u_x=..., u_y=..., v_x=..., nv=48, u_t=..., v_t=..., d_t=..., i=8,
MPT: i_base=2, m=6)
MPT: at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd_t2.F:174
MPT: #7 0x00002aaab40f74dd in eigen_prd_mod::eigen_prd_body (a=..., nm=80,
MPT: d_out=..., e_out=..., ne=8, n=8, m_orig=48, w=..., u_x=..., u_y=...,
MPT: v_x=..., v_y=..., u_t=..., v_t=..., d_t=..., nv=48)
MPT: at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:493
MPT: #8 0x00002aaab40f655c in eigen_prd_mod::L_eigen_prd_mod_mp_eigen_prd_stub__280__par_region0_2_0 ()
MPT: at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:281
MPT: #9 0x00002aaab2d2ea43 in __kmp_invoke_microtask ()
MPT: from /home/app/intel/compilers_and_libraries_2018.5.274/linux/compiler/lib/intel64/libiomp5.so
MPT: #10 0x00002aaab2cf22c6 in __kmp_fork_call (loc=0x7fffffffa8a8,
MPT: gtid=-1271736648, call_context=fork_context_gnu, argc=2,
MPT: microtask=0x2aaaaab11090, invoker=0x7fffffffbe10, ap=0x7fffffffbc90)
MPT: at ../../src/kmp_runtime.cpp:2113
MPT: #11 0x00002aaab2cb1bb0 in __kmpc_fork_call (loc=0x7fffffffa8a8,
MPT: argc=-1271736648, microtask=0x0) at ../../src/kmp_csupport.cpp:365
MPT: #12 0x00002aaab40f575c in eigen_prd_mod::eigen_prd_stub (a=..., nm=80,
MPT: d_out=..., e_out=..., ne=8, n=8, m_orig=48)
MPT: at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:280
MPT: #13 0x00002aaab40f365c in eigen_prd_mod::eigen_prd (n=8, a=..., nma0=80,
MPT: d_out=..., e_out=..., nme0=8, m_orig=48)
MPT: at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_prd.F:134
MPT: #14 0x00002aaab41043fe in eigen_sx (n=8, nvec=8, a=..., lda=80, w=..., z=...,
MPT: ldz=80, m_forward=48, m_backward=128, mode='A', .tmp.MODE.len_V$a5=1)
MPT: at /work/k0515/k051500/build/EigenExa-2.4b/src/eigen_sx.F:181
MPT: #15 0x000000000040626b in main ()
MPT: at /home/k0515/k051500/development/rokko/example/eigenexa/eigen_sx_f.f90:41
MPT: #16 0x000000000040562e in main ()
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 69842] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/69842/exe, process 69842
MPT: -----stack traceback ends-----
MPT: On host r1i5n4, Program /work/k0515/k051500/build/rokko_debug/example/eigenexa/eigen_sx_f, Rank 0, Process 69842: Dumping core on signal SIGSEGV(11) into directory /home/k0515/k051500/jobscript/dense/example
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
EigenExa付属のベンチマークプログラムmain2
を実行してみた。
入力ファイルIN
を、以下のよう書き換えた。
! N nvec bx by m t s e
10 10 48 128 1 0 0 1
100 100 48 128 1 0 0 1
-1
上記の変更点:
enagaでの実行が正常終了した。
付属プログラムmain2
seg faultとなる。
Intel MPIと最新のインテルコンパイラで試す。
module list
Currently Loaded Modulefiles:
1) intel/20.0.1 2) intel-mpi/5.1.3
./configure CC=mpiicc FC=mpiifort CFLAGS="-g" FFLAGS="-g"
ジョブスクリプトに追記した。
cd ${PBS_O_WORKDIR}
. /etc/profile.d/modules.sh
module unload gcc
module unload intel/18.0.5
module unload mpt
module load intel/20.0.1
module load intel-mpi
Intelでも、seg fault
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
eigenexa_benchmar 00000000004981EA Unknown Unknown Unknown
libpthread-2.17.s 00002AAAB30CE630 Unknown Unknown Unknown
eigenexa_benchmar 0000000000459588 Unknown Unknown Unknown
eigenexa_benchmar 000000000043FEB9 Unknown Unknown Unknown
eigenexa_benchmar 000000000043E83B Unknown Unknown Unknown
eigenexa_benchmar 000000000043CB8E Unknown Unknown Unknown
eigenexa_benchmar 000000000042CA92 Unknown Unknown Unknown
eigenexa_benchmar 000000000041A79D Unknown Unknown Unknown
eigenexa_benchmar 0000000000406E62 Unknown Unknown Unknown
libc-2.17.so 00002AAAB4578545 __libc_start_main Unknown Unknown
eigenexa_benchmar 0000000000406D69 Unknown Unknown Unknown
開発者に報告する。
デバッグ情報を出力するには、SGI MPTでなければダメか?
SGI MPTを用いる。
コンパイルオプション-traceback
を付けてみる。
./configure CFLAGS="-g -traceback" FFLAGS="-g -traceback"
行列サイズは100でも終わらない。 対角化は終わっている。