Open klaudia-dais opened 6 years ago
Thanks for moving this to the open issue tracker. Please attach the workflow for compiling and the minimal run script (you can ask Miha how to reduce it to the basic parts w/o all the extra bits I added)
My procedure: git clone https://www.github.com/qusers/Q6.git
cd Q6/src/
module purge
module load intel/17.4
module load openmpi/2.1.1
make all COMP=ifort
make mpi COMP=ifort
Running script (just important part):
module purge module load intel/18.0 intelmpi/18.0
mpirun -np 4 /home/klaudia/Q/Q6/bin/Qdyn6p relax.inp > relax.log
Thanks, please also upload the input files needed for relax.inp
Sorry, but I meant that you attach an archive with all the files (input, topology, if needed fep file)
Perfect, thank you!
Please try to build Q6 with the modules intel/17.4 intelmpi/17.4 I saw a large number of compile warnings with intel and openmpi, so they might not be compatible. When running your job, use srun -n $THISCORES inside your sbatch file
Hej Klaudia and Paul,
Also you should comment out the -Nmpi card in the makefile if compiling with intel only, that variable doesn't exist for intelmpi.
You can try compiling just with intel like so:
module load intelmpi/18.0
module load intel/18.0
make mpi COMP=ifort
make all COMP=ifort
Using the attached makefile.
Before using the makefile make sure to do.
mv makefile.txt makefile
For some reason github doesn't allow uploads of extension less files, that's why I uploaded it with the .txt extension.
The compilation takes a looooooooong time, no idea why. Any idea why the compilation is so very slow Paul?
I will try to run your files at rackham and see what's going on too.
Cheers,
M.
I ran a test already with two cores and srun and it worked fine. The compilation is slow with intel because the function inliner seems to go insane somewhere, but the inlining is needed for performance. I need to upload a patch for the makefile, the flag doesn't do any harm but the warning is confusing, I agree.
If there are no more issues now I would close this one again. Otherwise we could keep it open as a reminder that we need to fix the intel/openmpi combination.
Cheers
Paul
Also, Mauricio, can you make a quick pull request for the makefile (or push it yourself)? So we can at least get rid of the annoying warnings. :D
Hej,
So, I am missing something from Rackham. Klaudia's example only works with srun. Any clue as to the reason Paul?
If I don't use srun nasty MPI messages appear, but with srun all seems fine and dandy.
I can send a pull-request with the makefile, but, first accepting your
Ooops, I don't know how I managed to close this. I haven't been able to solve the MPI issues at rackham with intel. Has Klaudia managed to solve them?
fine with me. I did not have more time to look into this, but it reliably crashed ddt during the mpi_init part. No idea what the heck is going on, might be an issue with the mpi set-up on rackhem
Mauricio, did you have some more luck in testing this?
Hej, Good reminder. Last time I tried using srun had done the trick, which, is very odd, since sbatch and srun are both talking to slurm in the same way, AFAIK. I will give it another try with the latest version and write back what I see.
For some reason in the Rackham cluster they have aliased mpirun to echo this:
alias mpirun='echo Please use srun'
/usr/bin/echo
They say that this is needed when using intel compiled programs, that is, changing the use of mpirun to srun.
So, compiling with:
module load intel/18.1
module load intelmpi/18.1
make all COMP=ifort
Produces a binary which works when invoked with:
srun -n 8 Qdyn6p eq1.inp > eq1.log
Paul. I guess you can close these if @klaudia-dais also sees her jobs running when Q is compiled and run in the suggested way.
I haven't heard anything there again, but I think we may want to keep it until we add something about this to the readme?
Hej, Good idea. M.
I compiled Q on Rackham (intel) and with mpi it shows error unknown option -Nmpi. But even with that it finish. When I submit job it finish imidietly with this kind of error:[r101:2335] An error occurred in MPI_Allreduce [r101:2335] reported by process [1808072705,0] [r101:2335] on communicator MPI_COMM_WORLD [r101:2335] MPI_ERR_OP: invalid reduce operation [r101:2335] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [r101:2335] and potentially your MPI job)
I tried different versions of intel and with intelmpi or openmpi. Everytime crash with similar error. When I run the same job on different cluster and local with Qdyn6 it works without problem.
Any idea how to solve it?