underworldcode / UWGeodynamics

Underworld Geodynamics
Other
81 stars 32 forks source link

Process stuck at "finished update of weights for swarm "PCZVT0Y7__swarm"". #234

Closed Peigen-L closed 3 years ago

Peigen-L commented 3 years ago

Hi, I am using HPC to run my model with UWGeodynamics v2.10.1 When I boosted up my model resolution from (128,64,64) to (296, 96, 96), the model computing stop at "finished update of weights for swarm "PCZVT0Y7__swarm"" and no HDF files projected in the target directory. Here is the log massage:

I contacted the HPC support team and they think I should contact the developers for help.

Some rc settings I used for this model:

julesghub commented 3 years ago

The local range of elements 74x48x48 is extremely high. I believe you're running out of RAM on a given CPU. Were other log files produced from the HPC system, i.e. error logs?

I suggest adding many more CPUs, try factor 16 increase. For the (128,64,64) run what was the local range? Try and produce a similar range to that run.

Peigen-L commented 3 years ago

For 128,64,64, the local range is 16,16,8. I wonder how to increase resolution without not changing the local range of ranking? I have tried with 64 and 128 CPUs, issue remains the same. I didn't see the program running out of memory. I saw that some of the MPI ranks were empty, which means Underworld2 couldn't split my job to small and even MPI tasks.

julesghub commented 3 years ago

I wonder how to increase resolution without not changing the local range of ranking?

For 2D models, if you double the resolution you'll require 4x cpu count. Ffor 3D models, 2x resolution means 8x cpu count.

I didn't see the program running out of memory. I saw that some of the MPI ranks were empty, which means Underworld2 couldn't split my job to small and even MPI tasks.

Underworld could split on every processor?!? Sounds like a configuration/compilation issue or perhaps a system runtime environment issue.

Which HPC machine are you using? Do you have some information on it? Can you send: 1) the config.cfg file located under underworld/libUnderworld/config.cfg. 2) the execution script for the HPC machine. (not the underworld python script)

Peigen-L commented 3 years ago

Hi, @julesghub . Before we checking with the system environment or how Underworld was configured/compiled, can you please give me a email address so I can share my codes in .py and/or .ipynb to you to see if you can reproduce what I get? I am concerned about it is my coding mistake.

And I would like to give the conversation log to you and please have a check what is going on when there were "too many" MPI tasks for a small model.

Reason found. Rather than too big, which I worried about, your model is actually too small for (64 cores/node X 2 nodes).
I made a copy of your test to /p9/mcc_betmap/3d_subduction_modelling/Test/3D_LONGER_GAP_YF.
The job seems to be running well on one node. Job ID 43451577.
[yongjiaf@pud198 3D_LONGER_GAP_YF]$ pwd
/p9/mcc_betmap/3d_subduction_modelling/Test/3D_LONGER_GAP_YF
[yongjiaf@pud198 3D_LONGER_GAP_YF]$ diff 3D_MODEL.sh ../3D_LONGER_GAP/3D_MODEL.sh 
2c2
< #rj name="3D_model_longergap_yf" queue=betmap runtime=24 features=knl&roce nodes=1 taskspernode=64
---
> #rj name="3D_model_longergap" queue=betmap runtime=24 features=knl&roce nodes=2 taskspernode=64
11c11
< # cd 3D_LONGER_GAP_YF
---
> cd 3D_LONGER_GAP
5:23
@p.luo test done.
[yongjiaf@pud198 logs]$ pwd
/p9/mcc_betmap/3d_subduction_modelling/Test/3D_LONGER_GAP_YF/logs
[yongjiaf@pud198 logs]$ ls -ltr
total 64
-rw-rw---- 1 yongjiaf mcc_betmap  1617 May  7 17:08 3D_model_longergap_yf.o43451292
-rw-rw---- 1 yongjiaf mcc_betmap 15975 May  7 17:21 3D_model_longergap_yf.o43451577
[yongjiaf@pud198 logs]$ tail 3D_model_longergap_yf.o43451577
2021-05-07T17:20:40+0800:   done 67% (1366 cells)...
2021-05-07T17:20:41+0800:   done 100% (2048 cells)...
2021-05-07T17:20:41+0800: WeightsCalculator_CalculateAll(): finished update of weights for swarm "6OOSELLK__swarm"
2021-05-07T17:20:48+0800: In func WeightsCalculator_CalculateAll(): for swarm "6OOSELLK__swarm"
2021-05-07T17:20:49+0800:   done 33% (683 cells)...
2021-05-07T17:20:49+0800:   done 67% (1366 cells)...
2021-05-07T17:20:50+0800:   done 100% (2048 cells)...
2021-05-07T17:20:50+0800: WeightsCalculator_CalculateAll(): finished update of weights for swarm "6OOSELLK__swarm"
2021-05-07T17:21:23+0800: Step:     2 Model Time: 72944.9 year dt: 36167.0 year (2021-05-07 17:21:23)
2021-05-07T17:21:27+0800: == JOB END  STATUS=0 HOST=pnod2-20-45 DATE=Fri May  7 17:21:27 AWST 2021 RUNTIME=703 ==

Thank you so much for your help.

Kind regards

Peigen

julesghub commented 3 years ago

@Peigen-L - for sharing .ipynb or .py files we prefer to use github repos rather than email. I have made a private repo for you to upload the files too @ https://github.com/underworld-community/Peigen-HPC Please upload the files there and I can take a look at them.

Which HPC are you running the code on?

Peigen-L commented 3 years ago

Thank you so much. @julesghub I am using the HPC from DUG McCloud. I will upload the files later in the repo.

julesghub commented 3 years ago

Thanks for the file uploads. 2 things I'll point out: 1) mumps and 3D models is generally a bad idea. It consumes lots of memory and resources for jobs as large as yours. Please comment out this line: Model.solver.set_inner_method("mumps") in your models. Can you check if the original "working" model at (128,64,64) works reasonably without mumps.

2) Dug McCloud. I have no experience using it. From the failed log file I see:

2021-05-20T09:50:31+0800: # SLURMD_NODENAME: pnod2-19-44
2021-05-20T09:50:31+0800: # SLURM_NTASKS: 64
2021-05-20T09:50:31+0800: # SLURM_NTASKS_PER_NODE: 64
2021-05-20T09:50:31+0800: # SLURM_JOB_NODELIST: pnod2-19-44

Can you increase the number of nodes listed to get access to more CPUs? The cloud docs/admin should be able to tell you how. Please share here if you find out, I'm curious :thinking: A possible test of multi-node usage would be to run the "working" model over 2+ node, i.e. more >64 CPUs.

As previously mentioned a reasonable CPU count for the "hiRes" model would be ~8*64 (512) CPUs (considering double the resolution of the working model). If that works you can likely reduce the CPU count to find more optimal compute conditions.

Peigen-L commented 3 years ago

Hi, @julesghub Thank you for your comments. I use mumps because I always have the best luck with it. I will try other solver options with more CPUs today, and tell you the results later.

Peigen-L commented 3 years ago
2021-05-21T18:51:52+0800: == JOB START  NAME=3D_GAP_UWGEO_no_gap_fixed_2 QUEUE=betmap HOST=pnod1-12-27 DATE=Fri May 21 18:51:52 AWST 2021 JOBID=43615359 ARRAY_JOBID= TASK= DEPENDENCIES= ==
2021-05-21T18:51:52+0800: # SLURMD_NODENAME: pnod1-12-27
2021-05-21T18:51:52+0800: # SLURM_NTASKS: 256
2021-05-21T18:51:52+0800: # SLURM_NTASKS_PER_NODE: 64
2021-05-21T18:51:52+0800: # SLURM_JOB_NODELIST: pnod1-12-27,pnod1-13-32,pnod2-18-4,pnod2-19-37
2021-05-21T18:52:05+0800: [NbConvertApp] Converting notebook 3D_GAP_UWGEO_no_gap_fixed.ipynb to python
2021-05-21T18:52:08+0800: [NbConvertApp] Writing 20762 bytes to 3D_GAP_UWGEO_no_gap_fixed.py
2021-05-21T18:53:24+0800: loaded rc file /p9/mcc_betmap/sw/intel-python/lib/python3.7/site-packages/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
2021-05-21T18:53:24+0800:   Global element size: 256x128x128
2021-05-21T18:53:24+0800:   Local offset of rank 0: 0x0x0
2021-05-21T18:53:24+0800:   Local range of rank 0: 32x32x16
2021-05-21T18:53:38+0800: In func WeightsCalculator_CalculateAll(): for swarm "XTCTLA4G__swarm"
2021-05-21T18:53:45+0800:   done 33% (5462 cells)...
2021-05-21T18:53:51+0800:   done 67% (10923 cells)...
2021-05-21T18:53:57+0800:   done 100% (16384 cells)...
2021-05-21T18:53:57+0800: WeightsCalculator_CalculateAll(): finished update of weights for swarm "XTCTLA4G__swarm"
2021-05-21T20:23:25+0800: Running with UWGeodynamics version 2.10.2
2021-05-21T20:23:25+0800: Options:  -Q22_pc_type gkgdiag -force_correction True -ksp_type bsscr -pc_type none -ksp_k2_type NULL -rescale_equations False -remove_constant_pressure_null_space False -change_backsolve False -change_A11rhspresolve False -restore_K False -A11_ksp_type fgmres -A11_ksp_rtol 1e-4 -scr_ksp_type fgmres -scr_ksp_rtol 1e-3

Hi, @julesghub

After two hours wait with 256 CPUs. The computing for updating of weights for swarm finished and continue. I used default slover "fmgres". This is the solver option I used for this:

GEO.rcParams["initial.nonlinear.tolerance"] = 1e-2
GEO.rcParams['initial.nonlinear.max.iterations'] = 50
GEO.rcParams["nonlinear.tolerance"] = 1e-2
GEO.rcParams['nonlinear.max.iterations'] = 50
GEO.rcParams["popcontrol.particles.per.cell.3D"] = 40
GEO.rcParams["swarm.particles.per.cell.3D"] = 40
Model.solver.options.A11.ksp_rtol = "1e-4" # inner rtol
Model.solver.options.scr.ksp_rtol = "1e-3" # outer rtol
Model.solver.set_penalty(1e2)
Peigen-L commented 3 years ago

I think this issue has been solved, thank you so much.

julesghub commented 3 years ago
2021-05-21T18:51:52+0800: # SLURM_NTASKS: 256
2021-05-21T18:51:52+0800: # SLURM_NTASKS_PER_NODE: 64
2021-05-21T18:51:52+0800: # SLURM_JOB_NODELIST: pnod1-12-27,pnod1-13-32,pnod2-18-4,pnod2-19-37

That looks more promising. Great work!

A 2 hour wait for the swarm weighting doesn't seem good. Consider pushing to higher numbers of CPUs to minimise the work per CPU for the swarms weight.

The solver options seem good. Ideally the solve part of a model should be the longest operation - not swarm weights, particle advection nor writing to disk (checkpointing).

Peigen-L commented 3 years ago

Thank you @julesghub. I have also run a test with the situation "when there were 'too many' MPI tasks for a small model". It is a little bit strange that Underworld just hanging there and do nothing.

There were only 32 cells in the job JobID=43611844, but there were (4 nodes x 64 tasks/node) = 256 MPI tasks for it.
The job should not have been configured like that. Please check what happened.
Underworld should have given a warning then quit with error, rather than hanging there.
Peigen-L commented 3 years ago

And I also want to improve the efficiency of the HPC by trying to "tell" the computing nodes what to do and get a "perfectly paralleled" MPI tasks and keep them busy and avoid any wastes of resource. How can I achieve that "perfectly paralleled"?

julesghub commented 3 years ago

"Perfectly parallel" is difficult because it all depends on the model configuration and hardware setup.

When using fmgres your model is using multigrid (MG). The MG strategy works best with even number of elements in each direction as it iterates to halve the mesh along x, y & z. The more iterations in general the better MG.

The best thing to do is a "parallel test". 1) Take the model you want to use and run for only xxx timesteps - assuming the xxx steps produce the characteristic complexity of the model evolution. Let's say xxx=6 for now. 2) Using the exact same model ( on the same machine, with the same version of the code) try run it on 2x and 4x CPUs. Does that improve over the xxx steps. 3) Use the above information to further seek the optimum CPU numbers for the model configuration.

Be aware that changes to the model, e.g. different resolution, different solver options, more writing to disc could significantly effect the model performance. When making a change like this note it down.

Be aware of over decomposition situations, when a model is run on more CPUs than it needs: communication overhead between CPUs can dominate the runtime.

Peigen-L commented 3 years ago

Thank you @julesghub. I doubled the number of CPUs again and it works fine.