underworldcode / UWGeodynamics

Underworld Geodynamics
Other
81 stars 32 forks source link

Issue with mpi on MonARCH HPC Cluster #138

Closed HanyMKhalil closed 5 years ago

HanyMKhalil commented 5 years ago

Dear Romain, I constructed a simple model in UW geodynamics of two layer a Viscoplastic layer on top of viscous layer with a seed initially in the viscous layer and it runs fine with low resolution but when I increased the resolution the model stuck either at initialising the model or even at the very beginning and it does not give any errors just keep running until time is out? what is the recommended cpu cores (ntasks) I should use with the resolution? and how the model decompose it? I use singularity to run on MonARCH this is a copy of my slurm file:

!/bin/bash

SBATCH --job-name=ModelC

SBATCH --nodes=1

SBATCH --ntasks=24

SBATCH --cpus-per-task=1

SBATCH --partition=short

SBATCH --mem=72G

SBATCH --time=20:00:00

SBATCH --mail-type=ALL

SBATCH --mail-user=Hany.Khalil@monash.edu

SBATCH --error=%j.errors

SBATCH --output=%j.output

module purge

Xvfb :0 -screen 0 1600x1200x16& export DISPLAY=:0

module load python

module load singularity

module load singularity/3.0.2

run underworld in docker

singularity exec --cleanenv /usr/local/underworld/2.8.0b/uwsingularity.simg mpirun -np ${SLURM_CPUS_ON_NODE} python ModelC.py

jmansour commented 5 years ago

@HanyMKhalil what is your model resolution? I'd suggest for 2d simulations, you aim for around 128x128 elements per process, while for 3d simulations around 32x32x32 elements per process.

Perhaps post your ModelC.py file up here.

HanyMKhalil commented 5 years ago

Dear John, it's a 3D model the dimensions are (250km, 250km, 40km) and I aim to 1.25km resolution, so that my element resolution is (200, 200, 32). that's why I run it with 24 core (though it will speed up the process).

I tried with different element resolution, even low resolutions like 64, but still the same issue???

Here is my model, I added txt at the end so that I can upload it here

ModelC.py.txt

HanyMKhalil commented 5 years ago

the problem is it runs so slow without giving any error messages, and at some point its dead, usually at model coordinates or model inititation, no outputs come out, however keep running till time is out!!!!

PatriceFRey commented 5 years ago

@HanyMKhalil, try 2.5 km resolution. I am running models with dimensions 384x256x128km, at a grid resolution of 2 km it is slow but I manage to get 10 myr over 96 hours. At 1.6 km it is five times slower... Patrice

HanyMKhalil commented 5 years ago

Thanks Patrice I will try to do that, its a relief that some one do similar thing to mine, can I ask what configuration you use? like how many cores and memory?

PatriceFRey commented 5 years ago

These UWGeodynamics (v2.7.7) models run on 128 cpu, and mem=700GB. You can check an example on instagram bghatlas.

HanyMKhalil commented 5 years ago

Thanks a lot, for sure I cannot have these number of cores on Monarch but I will try to lower my resolution to very coarse to see if it works and then try to go up, because I guess I have a problem in defining the number of cores with my resolution grid or the way the model submitted to Monarch, like sth in parallel computing I do not know.

HanyMKhalil commented 5 years ago

if there is any chance you could run my model even for 0.1 m.y. just to test whether I have a problem in the model it self or the submission to Monarch? will be very helpful because I guess you are not using MonARCH

PatriceFRey commented 5 years ago

Sure, I can test your model on Raijin. Send your input file to patrice.rey@sydney.edu.au.

HanyMKhalil commented 5 years ago

Dear Romain, I attached you the errors file and the output file when run the model on MonARCH Looks like an issue with the way python is compiled in the docker image, with the associated outputs from the model. The model runs fine until this point so must be an UW problem, not a docker problem

4785285.errors.txt

4785285.output.txt

jmansour commented 5 years ago

So it ran to 100000 years. Was that the stop point? Kinda looks like it completed, but then failed to tear down cleanly, which is still not ideal, but not really a problem either.

HanyMKhalil commented 5 years ago

Dear John, Yes this is the end, however with high resolution it keeps running forever without producing any output? And the error file is empty

jmansour commented 5 years ago

Yep, but they're two separate issues.

rcarluccio commented 5 years ago

Hi Hany,

have you tried to run your HR job with mycode.py>test.log? that would generate a proper log file.

On Wed, Sep 11, 2019 at 1:59 PM John Mansour notifications@github.com wrote:

Yep, but they're two separate issues.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/underworldcode/UWGeodynamics/issues/138?email_source=notifications&email_token=AFMPHXOKTBCLWSIEUT6JKY3QJBUKHA5CNFSM4IUXBSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NFVZY#issuecomment-530209511, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXOAWCCDEKO6UIPRLDDQJBUKHANCNFSM4IUXBSIQ .

rbeucher commented 5 years ago

Hi @HanyMKhalil,

Sorry I was away for a while.

The error is related to python not stopping cleanly at the end of the model. This is a known issue. This should disappear in the next version of UW (2.9). It is annoying but should not affect your model results.

I am closing this.