qusers / Q6

Q6 Repository -- EVB, FEP and LIE simulator.
Other
30 stars 16 forks source link

Mpi crushed in Q6 intel- Rackham cluster #2

Open klaudia-dais opened 6 years ago

klaudia-dais commented 6 years ago

I compiled Q on Rackham (intel) and with mpi it shows error unknown option -Nmpi. But even with that it finish. When I submit job it finish imidietly with this kind of error:[r101:2335] An error occurred in MPI_Allreduce [r101:2335] reported by process [1808072705,0] [r101:2335] on communicator MPI_COMM_WORLD [r101:2335] MPI_ERR_OP: invalid reduce operation [r101:2335] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [r101:2335]     and potentially your MPI job)

I tried different versions of intel and with intelmpi or openmpi. Everytime crash with similar error. When I run the same job on different cluster and local with Qdyn6 it works without problem.

Any idea how to solve it?

acmnpv commented 6 years ago

Thanks for moving this to the open issue tracker. Please attach the workflow for compiling and the minimal run script (you can ask Miha how to reduce it to the basic parts w/o all the extra bits I added)

klaudia-dais commented 6 years ago

My procedure: git clone https://www.github.com/qusers/Q6.git

cd Q6/src/

module purge

module load intel/17.4

module load openmpi/2.1.1

make all COMP=ifort

make mpi COMP=ifort

Running script (just important part):

!/bin/bash -l

SBATCH -J Node1_OH

SBATCH -n 4

SBATCH -t 00:10:00

SBATCH -A p2011165

module purge module load intel/18.0 intelmpi/18.0

mpirun -np 4 /home/klaudia/Q/Q6/bin/Qdyn6p relax.inp > relax.log

acmnpv commented 6 years ago

Thanks, please also upload the input files needed for relax.inp

acmnpv commented 6 years ago

Sorry, but I meant that you attach an archive with all the files (input, topology, if needed fep file)

klaudia-dais commented 6 years ago

run.tar.gz

acmnpv commented 6 years ago

Perfect, thank you!

acmnpv commented 6 years ago

Please try to build Q6 with the modules intel/17.4 intelmpi/17.4 I saw a large number of compile warnings with intel and openmpi, so they might not be compatible. When running your job, use srun -n $THISCORES inside your sbatch file

esguerra commented 6 years ago

Hej Klaudia and Paul,

Also you should comment out the -Nmpi card in the makefile if compiling with intel only, that variable doesn't exist for intelmpi.

You can try compiling just with intel like so:

module load intelmpi/18.0
module load intel/18.0
make mpi COMP=ifort
make all COMP=ifort

Using the attached makefile.

Before using the makefile make sure to do.

mv makefile.txt makefile

For some reason github doesn't allow uploads of extension less files, that's why I uploaded it with the .txt extension.

makefile.txt

The compilation takes a looooooooong time, no idea why. Any idea why the compilation is so very slow Paul?

I will try to run your files at rackham and see what's going on too.

Cheers,

M.

acmnpv commented 6 years ago

I ran a test already with two cores and srun and it worked fine. The compilation is slow with intel because the function inliner seems to go insane somewhere, but the inlining is needed for performance. I need to upload a patch for the makefile, the flag doesn't do any harm but the warning is confusing, I agree.

acmnpv commented 6 years ago

If there are no more issues now I would close this one again. Otherwise we could keep it open as a reminder that we need to fix the intel/openmpi combination.

Cheers

Paul

acmnpv commented 6 years ago

Also, Mauricio, can you make a quick pull request for the makefile (or push it yourself)? So we can at least get rid of the annoying warnings. :D

esguerra commented 6 years ago

Hej,

So, I am missing something from Rackham. Klaudia's example only works with srun. Any clue as to the reason Paul?

If I don't use srun nasty MPI messages appear, but with srun all seems fine and dandy.

screen shot 2017-10-26 at 11 25 08 am

esguerra commented 6 years ago

I can send a pull-request with the makefile, but, first accepting your

esguerra commented 6 years ago

Ooops, I don't know how I managed to close this. I haven't been able to solve the MPI issues at rackham with intel. Has Klaudia managed to solve them?

acmnpv commented 6 years ago

fine with me. I did not have more time to look into this, but it reliably crashed ddt during the mpi_init part. No idea what the heck is going on, might be an issue with the mpi set-up on rackhem

acmnpv commented 6 years ago

Mauricio, did you have some more luck in testing this?

esguerra commented 6 years ago

Hej, Good reminder. Last time I tried using srun had done the trick, which, is very odd, since sbatch and srun are both talking to slurm in the same way, AFAIK. I will give it another try with the latest version and write back what I see.

esguerra commented 6 years ago

For some reason in the Rackham cluster they have aliased mpirun to echo this:

alias mpirun='echo Please use srun'
/usr/bin/echo

They say that this is needed when using intel compiled programs, that is, changing the use of mpirun to srun.

So, compiling with:

module load intel/18.1
module load intelmpi/18.1
make all COMP=ifort

Produces a binary which works when invoked with:

srun -n 8 Qdyn6p eq1.inp > eq1.log

Paul. I guess you can close these if @klaudia-dais also sees her jobs running when Q is compiled and run in the suggested way.

acmnpv commented 6 years ago

I haven't heard anything there again, but I think we may want to keep it until we add something about this to the readme?

esguerra commented 6 years ago

Hej, Good idea. M.