radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Test GPU support for RP on Titan #11

Closed mturilli closed 7 years ago

andre-merzky commented 7 years ago

Is it possible to provide some GPU workload for testing? It would be great to have some (lshort, informal) documentation on how to distinguish correct from incorrect execution (in the case where that is not obvious). Thanks!

vivek-bala commented 7 years ago

10 is now done. The executable currently being used is cpu only, but there is also a gpu version. I will work on compiling that on Titan. Maybe we can work with that.

wjlei1990 commented 7 years ago

Hi all,

I will list my environment of Titan here.

1. System Modules

In my ~/.bashrc file, I loaded modules:

  module unload PrgEnv-cray
  module unload PrgEnv-pgi
  module unload PrgEnv-ifort
  module load PrgEnv-gnu

  module load cudatoolkit

  module unload gcc
  module load gcc/4.9.3
  module swap cray-libsci cray-libsci/13.2.0
  module swap cray-mpich cray-mpich/7.2.5

  module load cray-netcdf-hdf5parallel/4.3.3.1
  module load cray-hdf5-parallel/1.8.14

  module load szip/2.1
  module load mxml/2.9
  module load adios/1.9.0
  module load rca

  module load cmake
  module load boost

The main modules we used here includes PrgEnv-gnu, gcc and cudatoolkit. We picked specific version of gcc(cray-libsci and cray-mpich are also adjusted because of the version of gcc) here because we used them before. I believe you may be able to use any latest version of gcc on Titan to compile the code.

The mxml and adios modules are for storing model files in adios format. The hdf5 related packages are for seismograms in ASDF format. cmake and boost are used for compiling the ASDF library.

All the modules I haved loaded on Titan:

[lei@titan-ext6 specfem3d_globe_11af69]$ module list
Currently Loaded Modulefiles:
  1) eswrap/1.3.3-1.020200.1278.0          18) xpmem/0.1-2.0502.64982.5.3.gem
  2) craype-network-gemini                 19) dvs/2.5_0.9.0-1.0502.2188.1.113.gem
  3) craype/2.5.9                          20) alps/5.2.4-2.0502.9774.31.12.gem
  4) cray-mpich/7.2.5                      21) rca/1.0.0-2.0502.60530.1.63.gem
  5) craype-interlagos                     22) atp/2.0.5
  6) lustredu/1.4                          23) PrgEnv-gnu/5.2.82
  7) xalt/0.7.5                            24) cudatoolkit/7.5.18-1.0502.10743.2.1
  8) module_msg/0.1                        25) cray-netcdf-hdf5parallel/4.3.3.1
  9) modulator/1.2.0                       26) cray-hdf5-parallel/1.8.14
 10) hsi/5.0.2.p1                          27) szip/2.1
 11) DefApps                               28) mxml/2.9
 12) cray-libsci/13.2.0                    29) adios/1.9.0
 13) udreg/2.3.2-1.0502.10518.2.17.gem     30) git/2.3.2
 14) ugni/6.0-1.0502.10863.8.28.gem        31) cmake/2.8.10.2
 15) pmi/5.0.11                            32) boost/1.57.0
 16) dmapp/7.0.1-1.0502.11080.8.74.gem     33) vim/7.4
 17) gni-headers/4.0-1.0502.10859.7.8.gem  34) gcc/4.9.3

2. Configure and Compile SPECFEM3D-GLOBE

*** There is an example of SPECFEM in the directory: /lustre/atlas/world-shared/geo111/wenjie/DATA_RADICAL/specfem3d_globe_11af69

  1. Configure Run the configure by ./configure.titan.sh. You can choose to omit flags in configure.titan.sh if you want to take out ASDF, for example. Attention: To output seismogram in ASDF library, you need to compile and link the external library of ASDF. The installation instruction is here. I included a pre-compiled library in the example specfem directory, but you may need to recompile it based on your environment.

  2. Compile Run the compile by ./compile.titan.sh. In the compile.titan.sh, the mesher and solver are compiled separately.

  3. Job submission(current) The current way of job submission script for mesher and solver are job_mesher.bash and job_solver.bash.

mturilli commented 7 years ago

@vivek-bala feature/gpu branches from RP and Saga ready for testing on Titan.

vivek-bala commented 7 years ago

Hey @wjlei1990 , I gave the instructions a try. No issues while configuration and compilation. Thanks.

My job however failed. I think some data is missing in (DATA/GLL/) in the location you provided above. I have attached the entire output log at https://gist.github.com/vivek-bala/bccd9f07ad99c9e3594939b656cdb15f.

Also, from that log it seems there is also MPI required. Just to be sure, is the job using both CPU+GPU or GPU only?

Thanks

wjlei1990 commented 7 years ago

Oops, I think in the example, I used our own model. I will modify the example and let you know later.

Also, you can modify the DATA/Par_file, by changing the parameter:

MODEL                           = GLL

to other models, like:

MODEL                           = 1D_isotropic_prem
wjlei1990 commented 7 years ago

I put a clean GPU compiled version located at: /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe_GPU The GPU version uses both CPU and GPU. The mpi is used for communication between nodes(using CPU).

The configure, compile and job submission script is located in the same directory.

A clean CPU compiled version located at: /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe

Both coming are newest version using git clone from the specfem website and doesn't need extra files...

vivek-bala commented 7 years ago

The compilation instructions and pbs scripts worked for me as well. Next thing to do it to compile these binaries against the openmpi on Titan. I will confirm with RP dev team which version of openmpi is to be used and get back to PU.

mturilli commented 7 years ago

@vivek-bala Please see and update RP ticket for/with documentation about RP and OpenMPI

vivek-bala commented 7 years ago

Hi @wjlei1990, I was able to compile the CPU version against the RP openmpi on Titan. I need to test it with the same job scripts as you did. Could you give me read permissions to all files in /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe and /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe_GPU please.

Thanks

vivek-bala commented 7 years ago

I believe my last comment has some error. I tried to use the Par_file from the GPU example for the CPU one. Although it didn't seem to complain correctly, if I remember correctly this is not expected to work.

vivek-bala commented 7 years ago

My attempt of compiling the GPU version has been unsuccessful so far.

I have the following modules loaded:

Currently Loaded Modulefiles:
  1) eswrap/1.3.3-1.020200.1278.0          12) cray-libsci/13.2.0                    23) PrgEnv-gnu/5.2.82
  2) craype-network-gemini                 13) udreg/2.3.2-1.0502.10518.2.17.gem     24) cmake/2.8.10.2
  3) gcc/4.9.3                             14) ugni/6.0-1.0502.10863.8.28.gem        25) boost/1.57.0
  4) craype/2.5.9                          15) pmi/5.0.11                            26) fftw/3.3.4.11
  5) craype-interlagos                     16) dmapp/7.0.1-1.0502.11080.8.74.gem     27) cudatoolkit/7.5.18-1.0502.10743.2.1
  6) lustredu/1.4                          17) gni-headers/4.0-1.0502.10859.7.8.gem  28) /openmpi/2017_05_04_539f71d
  7) xalt/0.7.5                            18) xpmem/0.1-2.0502.64982.5.3.gem        29) szip/2.1
  8) module_msg/0.1                        19) dvs/2.5_0.9.0-1.0502.2188.1.113.gem   30) mxml/2.9
  9) modulator/1.2.0                       20) alps/5.2.4-2.0502.9774.31.12.gem      31) adios/1.9.0
 10) hsi/5.0.2.p1                          21) rca/1.0.0-2.0502.60530.1.63.gem       32) cray-hdf5/1.10.0.1
 11) DefApps                               22) atp/2.0.5

I installed HDF5-parallel and set the following env variables:

HDF5_INC=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/include
HDF5_LIB=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/lib
HDF5_DIR=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/
HDF5_ROOT=/lustre/atlas/scratch/vivekb/bip149/hdf5-paralle

My configure.titan.sh file looks as follows:

#!/bin/bash

#see: ~/.bashrc
source /etc/profile
module list

mpif90=mpif90
mpicc=mpicc
f90=mpif90
cc=mpicc

## gnu compilers
warn="-Wunused -Waliasing -Wampersand -Wcharacter-truncation -Wline-truncation -Wsurprising -Wno-tabs -Wunderflow"
flags="-O3 -mcmodel=medium $warn"
cflags=""

##################################################
# with asdf, adios and cuda5 support
# 1. Make sure you download and compiled the asdf library(https://github.com/SeismicData/asdf-library).
# 2. load adios and cuda library from titan system module

./configure --with-asdf ASDF_LIBS="/lustre/atlas/scratch/vivekb/bip149/asdf-install/lib64/libasdf.a" --with-adios --with-cuda=cuda5 --host=x86_64-unknown-linux-gnu MPIF90=$mpif90 F90=$f90 CC=$cc FLAGS_CHECK="$flags" FCFLAGS="" CFLAGS="$cflags" CUDA_INC="$CUDATOOLKIT_HOME/include" CUDA_LIB="$CUDATOOLKIT_HOME/lib64" MPI_INC="/lustre/atlas2/csc230/world-shared/openmpi/installed/2017_05_04_539f71d/include"

##################################################
# with adios and cuda
# load adios and cuda library from titan system module

#./configure --with-adios --with-cuda=cuda5 --host=x86_64-unknown-linux-gnu MPIF90=$mpif90 F90=$f90 CC=$cc FLAGS_CHECK="$flags" FCFLAGS="" CFLAGS="$cflags" CUDA_INC="$CUDATOOLKIT_HOME/include" CUDA_LIB="$CUDATOOLKIT_HOME/lib64" MPI_INC="$CRAY_MPICH2_DIR/include"

##
## setup
##
echo
echo "modifying mesh_constants_cuda.h..."
sed -i "/ENABLE_VERY_SLOW_ERROR_CHECKING/ c\#undef ENABLE_VERY_SLOW_ERROR_CHECKING" src/gpu/mesh_constants_cuda.h

echo
echo "done"
echo

The configure step ended successfully but the compilation ends with the following error:

/usr/bin/ld: cannot find -lhdf5hl_fortran
/usr/bin/ld: cannot find -lhdf5_hl
/usr/bin/ld: cannot find -lhdf5
/usr/bin/sha1sum: ./bin/xspecfem3D: No such file or directory
collect2: error: ld returned 1 exit status

Just to test with the system module, I loaded cray-hdf5 and redid the steps (configure and compile). The compile ended with the same error.

wjlei1990 commented 7 years ago

Hi Vivek,

Do you need to install hdf5 library yourself(because of the MPI you used on Titan)?

I am currently using the system module(on Titan):

25) cray-netcdf-hdf5parallel/4.3.3.1
26) cray-hdf5-parallel/1.8.14

I think the module you loaded is cray-hdf5 may not work properly in parallel.

vivek-bala commented 7 years ago

Hey Wenjie,

Yes, I tried with a custom hdf5-parallel installation (compiled against the openmpi). I set the following environment variables:

HDF5_INC=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/include
HDF5_LIB=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/lib
HDF5_DIR=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel/
HDF5_ROOT=/lustre/atlas/scratch/vivekb/bip149/hdf5-parallel

But it gave me the same error. I'll give it another go with setting LD_LIBRARY_PATH as well.

wjlei1990 commented 7 years ago

Ok. I have not tried to compile and install hdf5 myself on Titan myself.

I think the error should comes from some configurations. Did you add the fortran flag when compile the hdf5? Have tried writing a test program to test your library is working?

Keep me posted...

vivek-bala commented 7 years ago

We have a working installation of specfem that uses GPUs. This is built against the Titan MPI modules and not the RP openmpi. We can use RP in the aprun mode that does not have the restriction of using the RP openmpi.

Currently, blocked by https://github.com/radical-cybertools/radical.pilot/issues/1365.

mturilli commented 7 years ago

Undergoing testing. Blocked by https://github.com/radical-cybertools/radical.pilot/issues/1441

vivek-bala commented 7 years ago

Hey @wjlei1990 Wrf to the latest email exchanges, if you agree the output files are consistent with what you expected, please close this ticket.

mpbl commented 7 years ago

Forgot to close it before. It seems good.