radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Help compiling specfem against RP openmpi on Titan #16

Closed vivek-bala closed 7 years ago

vivek-bala commented 7 years ago

When trying to use specfem3D binary, I encounter the following error:

TRANSVERSE_ISOTROPY: T F
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            1
 TRANSVERSE_ISOTROPY: T F
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            2
 TRANSVERSE_ISOTROPY: T F
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            0
 TRANSVERSE_ISOTROPY: T F
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            3
 TRANSVERSE_ISOTROPY: T F
 TRANSVERSE_ISOTROPY: T F
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            1
 TRANSVERSE_ISOTROPY: T F
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            3
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            2
 TRANSVERSE_ISOTROPY: T F
 Error in compiled parameters, please recompile solver 14
 Error detected, aborting MPI... proc            0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
with errorcode 30.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[titan-batch8:23028] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[titan-batch8:23028] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Can be recreated by running the radical_pilot_cu_launch_script.sh at /lustre/atlas/scratch/vivekb/bip149/radical.pilot.sandbox/rp.session.titan-ext5.vivekb.017261.0014/pilot.0000/unit.000001. Please be sure to be on the compute node (you can use the interactive jobs to do so. I think I have set global permission for all the files, let me know if you face permission issues.

Currently, the following is the sequence of commands used to compile specfem on titan.

module swap PrgEnv-pgi PrgEnv-gnu
module use --append /lustre/atlas/world-shared/csc230/openmpi/modules/
module load openmpi/2017_03_24_6da4dbb-unsorted
git clone --recursive --branch devel https://github.com/geodynamics/specfem3d_globe.git
./configure FC=mpif90 CC=mpicc MPIFC=mpif90

also tried the following

./configure FC=gfortran CC=gcc MPIFC=mpif90
make clean
make create_header_file
./bin/xcreate_header_file
make clean
make meshfem3D
make specfem3D
vivek-bala commented 7 years ago

There is no error reported during the above compilation sequence.

mpbl commented 7 years ago

Hi, @wjlei1990 Can you post the latest configuration script you are using for titan in a gist and link it in this issue? Thanks

mpbl commented 7 years ago

@vivek-bala

There is no error reported during the above compilation sequence.

What do you mean? Is it working now?

What you encounter is a result of specfem3d_globe having some parameters "hard-compiled'. The idea behind that is to "speed things up" by allowing the compiler to trim the execution path.

When you compile, it runs an executable called ./bin/xcreate_header_file that generates a file called OUTPUT_FILES/values_from_mesher.h with, for instance logical, parameter :: TRANSVERSE_ISOTROPY_VAL = .false." in it. This file is then included in the list of source files required to compiled the mesher and the solver.

TRANSVERSE_ISOTROPY_VAL is set up during the header creation and depends on the value of MODEL = [name your model] in DATA/Par_file.

If you decide to change the model, after the compilation, you might end up having conflicts between the read and the saved value.

In short, the solution is: Although some parameters in the Par_file can be changed (e.g. RECORD_LENGTH_IN_MINUTES), avoid changing anything in the Par_file after the compilation.

If you need to download / copy specfem to various locations, you want to ensure that DATA/Par_file and setup/constants.h(.in)are the same.

vivek-bala commented 7 years ago

Hey @mpbl , apologize for the delay.

What do you mean? Is it working now?

I meant that although there is no error during the compilation stage, there seems to be an error during execution which suggests it was not compiled correctly.

In short, the solution is: Although some parameters in the Par_file can be changed (e.g. RECORD_LENGTH_IN_MINUTES), avoid changing anything in the Par_file after the compilation.

I see what you mean. I didn't make any changes to the Par_file when I tried to execute.

Do you see any errors with the compilation process?

vivek-bala commented 7 years ago

Hey @mpbl , @wjlei1990 , Please share the compilation instructions for specfem with openmpi when you get a chance.

wjlei1990 commented 7 years ago

One moment, I will provide it tonight.

wjlei1990 commented 7 years ago

I have permission issue on this directory:

[lei@rhea-login3g ~]$ ls -alh /lustre/atlas/scratch/vivekb/bip149/radical.pilot.sandbox/rp.session.titan-ext5.vivekb.017261.0014/pilot.0000/unit.000001
ls: cannot access /lustre/atlas/scratch/vivekb/bip149/radical.pilot.sandbox/rp.session.titan-ext5.vivekb.017261.0014/pilot.0000/unit.000001: Permission denied
vivek-bala commented 7 years ago

Can you try again please? I think it should be fixed now.

wjlei1990 commented 7 years ago

It seems I still don't have access...

Maybe you can put it on the proj-shared directory? I think I just need to see if there are some errors in your files...

vivek-bala commented 7 years ago

Oops. Ok, I put the folder at /lustre/atlas/world-shared/csc230/rp.session.titan-ext5.vivekb.017261.0014 since it is world shared. Hopefully that works. Permissions look ok.

wjlei1990 commented 7 years ago

Thanks. I think now I have access to the directory.

I took a look at some sub-dirs, but most of them are empty. Could you point me to the path to your "Par_file"?

vivek-bala commented 7 years ago

Yes, since one of them failed then entire process is shutdown. I have put the scripts at "/lustre/atlas/world-shared/csc230/fwd_sims/" as well. You can find the Par_file at "/lustre/atlas/world-shared/csc230/fwd_sims/input_data/DATA".

wjlei1990 commented 7 years ago

I have prepared a example(compile specfem3d_globe using only the CPU) at: /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe

Please look at the file configure.titan.sh and compile.titan.sh to see how I configure and compile the code. I have run both the mesher and solver succuessfully on Titan using CPU. However, this version is based on MPICH.


I have another example using our local cluster, which uses openmpi instead. It uses the configure command you listed above:

./configure FC=mpif90 CC=mpicc MPIFC=mpif90

It uses openmpi/gcc/1.8.8/64 and run successfully both the mesher and solver.

vivek-bala commented 7 years ago

Hi @wjlei1990 , I don't have access to all the files (e.g. DATA/Par_file, etc.). Could you give me permissions to all the files in /lustre/atlas/world-shared/geo111/wenjie/specfem3d_globe please?

vivek-bala commented 7 years ago
config.status: error: cannot find input file: `DATA/Par_file'
wjlei1990 commented 7 years ago

Hi Vivek, I changed the permission. Let me know if it works for you.

vivek-bala commented 7 years ago

Thanks everyone for the help.

We have a working example that uses RP to submit meshfem and specfem tasks that use CPUs on Titan. Instructions are available here if anyone wants to give it a try.

vivek-bala commented 7 years ago

This example uses the meshfem and specfem binaries built against the Titan MPI modules and not the RP openmpi. The plan is to use RP in the aprun mode as it suffices for the simulation stages to be run on Titan. This avoids the restriction of having to compile all tools against RP openmpi.

The plan is to use the same mode for the experiments.