underworldcode / underworld2

underworld2: A parallel, particle-in-cell, finite element code for Geodynamics.
http://www.underworldcode.org/
Other
177 stars 59 forks source link

UW2.10.1b is 0.6 times slower than UW2.7.1b for the same python script #527

Closed tingyang2004 closed 3 years ago

tingyang2004 commented 3 years ago

Hi all,

I compared the model results and CPU time for UW2.10.1b and UW2.7.1b with a simple visco-plastic slab subduction model. The slab subduction model has a resolution of 400x120 elements and 10 cores are used with a 20 step data-saving frequency (two snapshots of the viscosity field at zero and 200 steps shown below). Although these two UW versions gave extremely close results after running for 200 steps, UW2.10.1b seems 0.6 times lower than UW2.7.1b (see the costed CPU time below). Do you know what may have caused this computational efficiency difference? Please let me know if more information is needed. Thanks a lot.

Best regards, Ting

image image

image

julesghub commented 3 years ago

Hi Ting, do you have the config.cfg files available for each build? If they achieve similar results I'm guessing it's an installation inefficiency. For version 2.10 the config.cfg is located at /underworld/libUnderworld/

Another idea is that the solver tolerances maybe different in the version (@jmansour did we tweak that at some stage). Do you have the output logs available for each run?

tingyang2004 commented 3 years ago

Thanks, Julian. The comparison was done on my own desktop, and the docker version was used: docker run -v $PWD:/home/jovyan/ --rm underworldcode/underworld2:2.10.1b mpirun -np 10 python 4ASlabSubduction.py vs docker run -v $PWD:/workspace/ --rm underworldcode/underworld2:2.7.1b mpirun -np 10 python 4SlabSubduction.py

Here is the stokes solver setup: image

The log files seem to suggest that the SCR RHS setup time in UW2.10.1b is longer than that of UW2.7.1b. image

tingyang2004 commented 3 years ago

Here is the setup for the nonlinear part: solver.solve(nonLinearIterate=True,nonLinearMaxIterations=200,nonLinearTolerance=0.003)

tingyang2004 commented 3 years ago

Is there any progress in solving the convergence efficiency issue?

jmansour commented 3 years ago

Hi Ting.

I'm unable to reproduce this. Running the standard slab subduction model against UW 2.7 & 2.10 the timings are relatively close for me.

Can you post UW 2.7 & 2.10 compatible versions of your script?

julesghub commented 3 years ago

image

I'm able to reproduce this in serial, I'm unclear what it could be. I'm thinking it's the number of integration points (gauss points) used.

julesghub commented 3 years ago

This was using the 06_SlabSubduction.ipynb in 2.10.1b and 2.7.1b

jmansour commented 3 years ago

Which model are you running @julesghub? Note that in 2.10, the slab subduction model defaults to mumps in serial, while in 2.7 it'll use lu.

julesghub commented 3 years ago

Yeah I have seen that but I'm not sure that's significant here, try to test now. I'm running the model as mentioned above.

tingyang2004 commented 3 years ago

Thanks, both,

Below are my python scripts for 2.10.1b (4SlabSubduction.py.txt) and 2.7.1b (4ASlabSubduction.py.txt). 4ASlabSubduction.py.txt 4SlabSubduction.py.txt

I observed this issue both on my own desktop (docker) and on my uni's hpc.

julesghub commented 3 years ago

Thanks for the models Ting. I'm still investigating what's going on with the example 06_SlabSubduction.ipynb model. It shows the same behaviour using 2.10 & 2.7 when the model's inner solve methods are set to mumps (like your models), i.e. 2.7 is quicker!

lmoresi commented 3 years ago

Interesting - the RHS setup time presumably includes building the SCR preconditioner which seems to have suddenly become more expensive but not more effective (iterations have not changed at all). That could be something to do with gauss points v particles as the default preconditioner is built by finding the average viscosity in an element.

julesghub commented 3 years ago

Using mumps on 06_SlabSubduction. 2.10 on the right, 2.7 on the left. Pressure solve time is the difference image

Interestingly when using lu for the inner solver (rather than mumps) the timings are very similar.

jmansour commented 3 years ago

Is it possibly a difference in versions of mumps? Although I'll note that @tingyang2004 observed this issue also on a HPC system, for which I'd assume he was using the same version of mumps in both UW2.7 & UW2.10 tests.

julesghub commented 3 years ago

Potentially, I'm not sure on how to check the version of mumps petsc pulls down.

@tingyang2004 did you use the dockers on HPC or compiled code? If compiled code, do the two versions use consistent petsc/mumps versions?

tingyang2004 commented 3 years ago

I did not check the versions of Petsc I used on HPC, but I assume they are different considering the one year and a half intervals between the installations. How to check the Petsc version conveniently?

tingyang2004 commented 3 years ago

After checking the petsc make.log, the version should be 3.10.5 (uw2.7.1b) and 3.12.4 (uw2.10.1b) respectively.

tingyang2004 commented 3 years ago

It looks to me that the new version of petsc used (3.12.4) has slowed down the Stokes solver. Using different versions of uw but the same version of petsc give similar solver times. So, is there any convenient way to let uw use the older version of petsc in docker?

tingyang2004 commented 3 years ago

Since petsc 3.12.4 is newer, I assume it should be faster than or at least at a similar speed to the older versions (e.g., 3.10.5 here) by deliberately tuning it?

jmansour commented 3 years ago

I ran some tests using PETSc 3.10.5 against UW 2.7 & 2.10, while the results were identical for lu, for mumps there were definite differences, with at times the older UW being faster, and at times the new UW. It's somewhat strange, but does appear to be due to a change in how we use PETSc.

Unfortunately I don't think we can spend more time on this as it's somewhat a niche issue and very difficult to debug. So if the performance hit is too much, I'd suggest sticking with the older Underworld, or perhaps you might investigate using superludist. This relatively recent publication suggests it does better than mumps for their testing configuration:

https://cug.org/proceedings/cug2016_proceedings/includes/files/pap121s2-file1.pdf

tingyang2004 commented 3 years ago

Strange, how much is the MUMPS time difference between UW 2.7 and 2.10? I will stick to PETSc 3.10.5 on HPC and UW 2.7 in docker at the moment then.

jmansour commented 3 years ago

It wasn't usually dramatic.. around 20% give or take from memory.

I'd suggest you at least try superludist. It should be installed in your Docker, and possibly on your HPC too depending on how PETSc was configured. So you'd simply invoke solver.set_inner_method("superludist").

tingyang2004 commented 3 years ago

Thanks, John. I will check if superludist is faster in the next few weeks.

jmansour commented 3 years ago

Traditionally people generally have had better luck with mumps, but superlu_dist seems to be actively being developed so definitely worth a try.

Let us know what you find.

tingyang2004 commented 3 years ago

Definitely, will report back when it's done.

tingyang2004 commented 3 years ago

I did not check the details, but changing MUMPS to superludist directly in 2.7.1 shows little influence on the convergence speed. However, changing MUMPS to superludist in 2.10.1 slows the convergence significantly (by around 50 times). So uw2.7.1 with MUMPS looks the best choice at present.

Tests are done in my docker.

julesghub commented 3 years ago

Closing this ticket. @tingyang2004 thanks for raising this issue. We are planning on implementing performance metrics because of this kind of issue. Cheers!