underworldcode / UWGeodynamics

Underworld Geodynamics
Other
81 stars 32 forks source link

Errors on Raijin #135

Closed btokay closed 5 years ago

btokay commented 5 years ago

Hi,

I completed the installation of underworld and uwgeodynamics on Raijin and besides python3 -c "import underworld" and python3 -c "import UWGeodynamics" gave no error. But after logout and then login, to run a UWGeodynamics job caused the following errors:

Traceback (most recent call last): File "./3DLithosphericModel.py", line 9, in import UWGeodynamics as GEO File "/home/565/bt1752/python-3.6.2-venv/lib/python3.6/site-packages/UWGeodynamics/init.py", line 8, in raise ImportError("Can not find Underworld, please check your installation") ImportError: Can not find Underworld, please check your installation Traceback (most recent call last): File "/short/n69/bt1752/Underworld/libUnderworld/libUnderworldPy/StGermain.py", line 18, in swig_import_helper return importlib.import_module(mname) File "/home/565/bt1752/python-3.6.2-venv/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 978, in _gcd_import File "", line 961, in _find_and_load File "", line 948, in _find_and_load_unlocked ModuleNotFoundError: No module named 'libUnderworld.libUnderworldPy._StGermain'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/565/bt1752/python-3.6.2-venv/lib/python3.6/site-packages/UWGeodynamics/init.py", line 5, in import underworld File "/short/n69/bt1752/Underworld/underworld/init.py", line 91, in import libUnderworld File "/short/n69/bt1752/Underworld/libUnderworld/init.py", line 1, in from .libUnderworldPy import * File "/short/n69/bt1752/Underworld/libUnderworld/libUnderworldPy/init.py", line 2, in from . import StGermain File "/short/n69/bt1752/Underworld/libUnderworld/libUnderworldPy/StGermain.py", line 21, in _StGermain = swig_import_helper() File "/short/n69/bt1752/Underworld/libUnderworld/libUnderworldPy/StGermain.py", line 20, in swig_import_helper return importlib.import_module('_StGermain') File "/home/565/bt1752/python-3.6.2-venv/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ImportError: /apps/openmpi/3.1.2/lib/Intel/libmpi_mpifh.so.40: undefined symbol: ompi_mpi_state

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./3DLithosphericModel.py", line 9, in import UWGeodynamics as GEO File "/home/565/bt1752/python-3.6.2-venv/lib/python3.6/site-packages/UWGeodynamics/init.py", line 8, in raise ImportError("Can not find Underworld, please check your installation") ImportError: Can not find Underworld, please check your installation

mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[4283,1],7] Exit code: 1

What do I miss? Do you have any suggestions?

Best regards, Bülent

julesghub commented 5 years ago

If you logged out then you'll need the LD_PRELOAD env var set again. Have you done that?

btokay commented 5 years ago

Yes, I had set LD_PRELOAD previously since I need to check if any error would occur as a result of import underworld and import UWGeodynamics in Python3. Unless LD_PRELOAD is set before run a job, they do not work unfortunately.

julesghub commented 5 years ago

Yes that's correct. Set the LD_PRELOAD env. var in your pbs script or prepend it to your execution command.

btokay commented 5 years ago

Hi Julian, Despite of setting LD_PRELOAD env var, I frequently got the same errors. For that reason, I re-installed UWGeodynamics by newer versions of python3 (python3.6.7) and petsc (petsc 3.9.4), but now different errors come about as the followings:

[46]PETSC ERROR: ------------------------------------------------------------------------ [46]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end [46]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [46]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [46]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [46]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [46]PETSC ERROR: to get more information on the crash. [46]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [46]PETSC ERROR: Signal received [46]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [46]PETSC ERROR: Petsc Release Version 3.9.4, Sep, 11, 2018 [46]PETSC ERROR: ./3DExtension.py on a named r2358 by bt1752 Sat Sep 7 23:29:00 2019 [46]PETSC ERROR: Configure options --with-shared-libraries=1 --prefix=/apps/petsc/3.9.4 --with-blaslapack-lib="-L/apps/intel-ct/2018.3.222/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread" --with-hypre --download-hypre=yes --with-ptscotch=1 --download-ptscotch=yes --download-ml=yes --download-mumps=yes --with-mumps=1 --with-scalapack=1 --with-scalapack-lib="/apps/intel-ct/2018.3.222/mkl/lib/intel64/libmkl_scalapack_lp64.a /apps/intel-ct/2018.3.222/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a" --with-debugging=0 --with-mpi=1 --with-parmetis=1 --download-parmetis=yes --with-metis=1 --download-metis=yes --with-superlu=1 --download-superlu=yes --with-superlu_dist=1 --download-superlu_dist=yes --with-necdf=1 --with-hdf5=1 --with-fftw=1 --with-fftw-include=/apps/intel-ct/2018.3.222/mkl/include/fftw --with-fftw-lib="/apps/fftw3-mkl/2018.3.222/lib/libfftw3x_cdft_openmpi3_lp64.a /apps/fftw3-mkl/2018.3.222/lib/libfftw3xc_intel.a" CC=mpicc CXX=mpicxx FC=mpif90 F77=mpif77 F90=mpif90 --CFLAGS=-mkl --CXXFLAGS=-mkl --download-ctetgen=yes --download-triangle=yes [46]PETSC ERROR: #1 User provided function() line 0 in unknown file [r2358:12107] [0] func:/apps/openmpi/3.1.3/lib/libopen-pal.so.40(opal_backtrace_buffer+0x35) [0x2afded25aefd] [r2358:12107] [1] func:/apps/openmpi/3.1.3/lib/libmpi.so(ompi_mpi_abort+0xa9) [0x2afdeb8dedf1] [r2358:12107] [2] func:/apps/openmpi/3.1.3/lib/libmpi.so(ompi_errhandler_callback+0x5b) [0x2afdeb8cd2fe] [r2358:12107] [3] func:/apps/openmpi/3.1.3/lib/openmpi/mca_pmix_ext2x.so(+0x503e) [0x2afdf0b0503e] [r2358:12107] [4] func:/apps/libevent/2.1.8/lib/libevent-2.1.so.6(+0x2122b) [0x2afdee08022b] [r2358:12107] [5] func:/apps/libevent/2.1.8/lib/libevent-2.1.so.6(event_base_loop+0x34f) [0x2afdee0809af] [r2358:12107] [6] func:/apps/openmpi/3.1.3/lib/libopen-pal.so.40(+0x324c2) [0x2afded2104c2] [r2358:12107] [7] func:/lib64/libpthread.so.0(+0x7aa1) [0x2afdec0d2aa1] [r2358:12107] [8] func:/lib64/libc.so.6(clone+0x6d) [0x2afdecc63c4d] [47]PETSC ERROR: ------------------------------------------------------------------------ [47]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end [47]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [47]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [47]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [47]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [47]PETSC ERROR: to get more information on the crash. [47]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [47]PETSC ERROR: Signal received [47]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [47]PETSC ERROR: Petsc Release Version 3.9.4, Sep, 11, 2018 [47]PETSC ERROR: ./3DExtension.py on a named r2358 by bt1752 Sat Sep 7 23:29:00 2019 [47]PETSC ERROR: Configure options --with-shared-libraries=1 --prefix=/apps/petsc/3.9.4 --with-blaslapack-lib="-L/apps/intel-ct/2018.3.222/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread" --with-hypre --download-hypre=yes --with-ptscotch=1 --download-ptscotch=yes --download-ml=yes --download-mumps=yes --with-mumps=1 --with-scalapack=1 --with-scalapack-lib="/apps/intel-ct/2018.3.222/mkl/lib/intel64/libmkl_scalapack_lp64.a /apps/intel-ct/2018.3.222/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a" --with-debugging=0 --with-mpi=1 --with-parmetis=1 --download-parmetis=yes --with-metis=1 --download-metis=yes --with-superlu=1 --download-superlu=yes --with-superlu_dist=1 --download-superlu_dist=yes --with-necdf=1 --with-hdf5=1 --with-fftw=1 --with-fftw-include=/apps/intel-ct/2018.3.222/mkl/include/fftw --with-fftw-lib="/apps/fftw3-mkl/2018.3.222/lib/libfftw3x_cdft_openmpi3_lp64.a /apps/fftw3-mkl/2018.3.222/lib/libfftw3xc_intel.a" CC=mpicc CXX=mpicxx FC=mpif90 F77=mpif77 F90=mpif90 --CFLAGS=-mkl --CXXFLAGS=-mkl --download-ctetgen=yes --download-triangle=yes [47]PETSC ERROR: #1 User provided function() line 0 in unknown file [r2358:12108] [0] func:/apps/openmpi/3.1.3/lib/libopen-pal.so.40(opal_backtrace_buffer+0x35) [0x2b814a2fcefd] [r2358:12108] [1] func:/apps/openmpi/3.1.3/lib/libmpi.so(ompi_mpi_abort+0xa9) [0x2b8148980df1] [r2358:12108] [2] func:/apps/openmpi/3.1.3/lib/libmpi.so(ompi_errhandler_callback+0x5b) [0x2b814896f2fe] [r2358:12108] [3] func:/apps/openmpi/3.1.3/lib/openmpi/mca_pmix_ext2x.so(+0x503e) [0x2b814dba703e] [r2358:12108] [4] func:/apps/libevent/2.1.8/lib/libevent-2.1.so.6(+0x2122b) [0x2b814b12222b] [r2358:12108] [5] func:/apps/libevent/2.1.8/lib/libevent-2.1.so.6(event_base_loop+0x34f) [0x2b814b1229af] [r2358:12108] [6] func:/apps/openmpi/3.1.3/lib/libopen-pal.so.40(+0x324c2) [0x2b814a2b24c2] [r2358:12108] [7] func:/lib64/libpthread.so.0(+0x7aa1) [0x2b8149174aa1] [r2358:12108] [8] func:/lib64/libc.so.6(clone+0x6d) [0x2b8149d05c4d]

mpiexec noticed that process rank 53 with PID 3542 on node r2393 exited on signal 9 (Killed).

[r1286:31581] 119 more processes have sent help message help-mpi-api.txt / mpi-abort [r1286:31581] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [r1286:31581] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I think no problem related to installation of UWGeodynamics is present. Am I right? If this is the case, what should I do ? ¯_(ツ)_/¯

Cheers,

rbeucher commented 5 years ago

OK. I will update the installation guide in the documentation. It currently says to use a virtual environment but that causes pb with petsc and the new UW 2.8 installation.

rbeucher commented 5 years ago

Hi @btokay @julesghub

Have you solved the pb?

Romain

btokay commented 5 years ago

Hi Romain, Apologies for the late reply! Yes, solved! That problem was due to the solver type "mumps". Other solver options should be used to run a UWG job on Raijin. Thanks @jmansour for your help! Bülent