Closed mturilli closed 7 years ago
Hi all,
I will list my environment of Titan here.
In my ~/.bashrc
file, I loaded modules:
module unload PrgEnv-cray
module unload PrgEnv-pgi
module unload PrgEnv-ifort
module load PrgEnv-gnu
module load cudatoolkit
module unload gcc
module load gcc/4.9.3
module swap cray-libsci cray-libsci/13.2.0
module swap cray-mpich cray-mpich/7.2.5
module load cray-netcdf-hdf5parallel/4.3.3.1
module load cray-hdf5-parallel/1.8.14
module load szip/2.1
module load mxml/2.9
module load adios/1.9.0
module load rca
module load cmake
module load boost
The main modules we used here includes PrgEnv-gnu
, gcc
and cudatoolkit
. We picked specific version of gcc(cray-libsci
and cray-mpich
are also adjusted because of the version of gcc
) here because we used them before. I believe you may be able to use any latest version of gcc on Titan to compile the code.
The mxml
and adios
modules are for storing model files in adios format. The hdf5
related packages are for seismograms in ASDF format. cmake
and boost
are used for compiling the ASDF library.
All the modules I haved loaded on Titan:
[lei@titan-ext6 specfem3d_globe_11af69]$ module list
Currently Loaded Modulefiles:
1) eswrap/1.3.3-1.020200.1278.0 18) xpmem/0.1-2.0502.64982.5.3.gem
2) craype-network-gemini 19) dvs/2.5_0.9.0-1.0502.2188.1.113.gem
3) craype/2.5.9 20) alps/5.2.4-2.0502.9774.31.12.gem
4) cray-mpich/7.2.5 21) rca/1.0.0-2.0502.60530.1.63.gem
5) craype-interlagos 22) atp/2.0.5
6) lustredu/1.4 23) PrgEnv-gnu/5.2.82
7) xalt/0.7.5 24) cudatoolkit/7.5.18-1.0502.10743.2.1
8) module_msg/0.1 25) cray-netcdf-hdf5parallel/4.3.3.1
9) modulator/1.2.0 26) cray-hdf5-parallel/1.8.14
10) hsi/5.0.2.p1 27) szip/2.1
11) DefApps 28) mxml/2.9
12) cray-libsci/13.2.0 29) adios/1.9.0
13) udreg/2.3.2-1.0502.10518.2.17.gem 30) git/2.3.2
14) ugni/6.0-1.0502.10863.8.28.gem 31) cmake/2.8.10.2
15) pmi/5.0.11 32) boost/1.57.0
16) dmapp/7.0.1-1.0502.11080.8.74.gem 33) vim/7.4
17) gni-headers/4.0-1.0502.10859.7.8.gem 34) gcc/4.9.3
*** There is an example of SPECFEM in the directory:
/lustre/atlas/world-shared/geo111/wenjie/DATA_RADICAL/specfem3d_globe_11af69
Configure
Run the configure by ./configure.titan.sh
. You can choose to omit flags in configure.titan.sh
if you want to take out ASDF, for example.
Attention: To output seismogram in ASDF library, you need to compile and link the external library of ASDF. The installation instruction is here. I included a pre-compiled library in the example specfem directory, but you may need to recompile it based on your environment.
Compile
Run the compile by ./compile.titan.sh
. In the compile.titan.sh
, the mesher and solver are compiled separately.
Job submission(current)
The current way of job submission script for mesher and solver are job_mesher.bash
and job_solver.bash
.
@vivek-bala feature/gpu branches from RP and Saga ready for testing on Titan.
Hey @wjlei1990 , I gave the instructions a try. No issues while configuration and compilation. Thanks.
My job however failed. I think some data is missing in (DATA/GLL/) in the location you provided above. I have attached the entire output log at https://gist.github.com/vivek-bala/bccd9f07ad99c9e3594939b656cdb15f.
Also, from that log it seems there is also MPI required. Just to be sure, is the job using both CPU+GPU or GPU only?
Thanks
Oops, I think in the example, I used our own model. I will modify the example and let you know later.
Also, you can modify the DATA/Par_file
, by changing the parameter:
MODEL = GLL
to other models, like:
MODEL = 1D_isotropic_prem
I put a clean GPU compiled version located at:
/lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe_GPU
The GPU version uses both CPU and GPU. The mpi
is used for communication between nodes(using CPU).
The configure, compile and job submission script is located in the same directory.
A clean CPU compiled version located at:
/lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe
Both coming are newest version using git clone
from the specfem website and doesn't need extra files...
The compilation instructions and pbs scripts worked for me as well. Next thing to do it to compile these binaries against the openmpi on Titan. I will confirm with RP dev team which version of openmpi is to be used and get back to PU.
@vivek-bala Please see and update RP ticket for/with documentation about RP and OpenMPI
Hi @wjlei1990, I was able to compile the CPU version against the RP openmpi on Titan. I need to test it with the same job scripts as you did. Could you give me read permissions to all files in /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe
and /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe_GPU
please.
Thanks
I believe my last comment has some error. I tried to use the Par_file from the GPU example for the CPU one. Although it didn't seem to complain correctly, if I remember correctly this is not expected to work.
My attempt of compiling the GPU version has been unsuccessful so far.
I have the following modules loaded:
Currently Loaded Modulefiles:
1) eswrap/1.3.3-1.020200.1278.0 12) cray-libsci/13.2.0 23) PrgEnv-gnu/5.2.82
2) craype-network-gemini 13) udreg/2.3.2-1.0502.10518.2.17.gem 24) cmake/2.8.10.2
3) gcc/4.9.3 14) ugni/6.0-1.0502.10863.8.28.gem 25) boost/1.57.0
4) craype/2.5.9 15) pmi/5.0.11 26) fftw/3.3.4.11
5) craype-interlagos 16) dmapp/7.0.1-1.0502.11080.8.74.gem 27) cudatoolkit/7.5.18-1.0502.10743.2.1
6) lustredu/1.4 17) gni-headers/4.0-1.0502.10859.7.8.gem 28) /openmpi/2017_05_04_539f71d
7) xalt/0.7.5 18) xpmem/0.1-2.0502.64982.5.3.gem 29) szip/2.1
8) module_msg/0.1 19) dvs/2.5_0.9.0-1.0502.2188.1.113.gem 30) mxml/2.9
9) modulator/1.2.0 20) alps/5.2.4-2.0502.9774.31.12.gem 31) adios/1.9.0
10) hsi/5.0.2.p1 21) rca/1.0.0-2.0502.60530.1.63.gem 32) cray-hdf5/1.10.0.1
11) DefApps 22) atp/2.0.5
I installed HDF5-parallel and set the following env variables:
HDF5_INC=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/include
HDF5_LIB=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/lib
HDF5_DIR=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/
HDF5_ROOT=/lustre/atlas/scratch/vivekb/bip149/hdf5-paralle
My configure.titan.sh file looks as follows:
#!/bin/bash
#see: ~/.bashrc
source /etc/profile
module list
mpif90=mpif90
mpicc=mpicc
f90=mpif90
cc=mpicc
## gnu compilers
warn="-Wunused -Waliasing -Wampersand -Wcharacter-truncation -Wline-truncation -Wsurprising -Wno-tabs -Wunderflow"
flags="-O3 -mcmodel=medium $warn"
cflags=""
##################################################
# with asdf, adios and cuda5 support
# 1. Make sure you download and compiled the asdf library(https://github.com/SeismicData/asdf-library).
# 2. load adios and cuda library from titan system module
./configure --with-asdf ASDF_LIBS="/lustre/atlas/scratch/vivekb/bip149/asdf-install/lib64/libasdf.a" --with-adios --with-cuda=cuda5 --host=x86_64-unknown-linux-gnu MPIF90=$mpif90 F90=$f90 CC=$cc FLAGS_CHECK="$flags" FCFLAGS="" CFLAGS="$cflags" CUDA_INC="$CUDATOOLKIT_HOME/include" CUDA_LIB="$CUDATOOLKIT_HOME/lib64" MPI_INC="/lustre/atlas2/csc230/world-shared/openmpi/installed/2017_05_04_539f71d/include"
##################################################
# with adios and cuda
# load adios and cuda library from titan system module
#./configure --with-adios --with-cuda=cuda5 --host=x86_64-unknown-linux-gnu MPIF90=$mpif90 F90=$f90 CC=$cc FLAGS_CHECK="$flags" FCFLAGS="" CFLAGS="$cflags" CUDA_INC="$CUDATOOLKIT_HOME/include" CUDA_LIB="$CUDATOOLKIT_HOME/lib64" MPI_INC="$CRAY_MPICH2_DIR/include"
##
## setup
##
echo
echo "modifying mesh_constants_cuda.h..."
sed -i "/ENABLE_VERY_SLOW_ERROR_CHECKING/ c\#undef ENABLE_VERY_SLOW_ERROR_CHECKING" src/gpu/mesh_constants_cuda.h
echo
echo "done"
echo
The configure step ended successfully but the compilation ends with the following error:
/usr/bin/ld: cannot find -lhdf5hl_fortran
/usr/bin/ld: cannot find -lhdf5_hl
/usr/bin/ld: cannot find -lhdf5
/usr/bin/sha1sum: ./bin/xspecfem3D: No such file or directory
collect2: error: ld returned 1 exit status
Just to test with the system module, I loaded cray-hdf5
and redid the steps (configure and compile). The compile ended with the same error.
Hi Vivek,
Do you need to install hdf5
library yourself(because of the MPI you used on Titan)?
I am currently using the system module(on Titan
):
25) cray-netcdf-hdf5parallel/4.3.3.1
26) cray-hdf5-parallel/1.8.14
I think the module you loaded is cray-hdf5
may not work properly in parallel.
Hey Wenjie,
Yes, I tried with a custom hdf5-parallel installation (compiled against the openmpi). I set the following environment variables:
HDF5_INC=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/include
HDF5_LIB=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/lib
HDF5_DIR=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/
HDF5_ROOT=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel
But it gave me the same error. I'll give it another go with setting LD_LIBRARY_PATH
as well.
Ok. I have not tried to compile and install hdf5
myself on Titan
myself.
I think the error should comes from some configurations. Did you add the fortran
flag when compile the hdf5? Have tried writing a test program to test your library is working?
Keep me posted...
We have a working installation of specfem that uses GPUs. This is built against the Titan MPI modules and not the RP openmpi. We can use RP in the aprun mode that does not have the restriction of using the RP openmpi.
Currently, blocked by https://github.com/radical-cybertools/radical.pilot/issues/1365.
Undergoing testing. Blocked by https://github.com/radical-cybertools/radical.pilot/issues/1441
Hey @wjlei1990 Wrf to the latest email exchanges, if you agree the output files are consistent with what you expected, please close this ticket.
Forgot to close it before. It seems good.
Is it possible to provide some GPU workload for testing? It would be great to have some (lshort, informal) documentation on how to distinguish correct from incorrect execution (in the case where that is not obvious). Thanks!