Multithreading with MPI

manavbhatia commented 4 years ago

I am curious about the libMesh team's approach to mulit-threading when PETSc is also being used. I see that Threads have been used at various locations in the code. However, PETSc does not see to encourage the use of threads.

So, if there is a multicore HPC system with, say 16 cores per node, and a job uses 4 such nodes for a MPI communicator of size of 64, would multithreading automatically be turned off there?

On the other hand, how would be a good use case of multithreading for such machines to get maximum efficiency from both libMesh and PETSc?

jwpeterson commented 4 years ago

We do use some threaded element loops. One idea would be to search for Threads:: in the code to get some idea of how/where threads are used. The main things we use are parallel_for, parallel_reduce, and spin_mutex for acquiring locks. You can see some of the precautions we have to take when working with PETSc objects in petsc_vector.C and petsc_matrix.C. Basically you have to ensure only one thread reads/writes to a PETSc data structure at a time.

manavbhatia commented 4 years ago

Thanks. I am still unclear about how well PETSc can benefit from a multi-threaded run. Do you know if its solvers can do both MPI and multi-threading at the same time?

For instance, in my example above, if I have 4 nodes with 16 cores each, would there be 1 MPI rank on each node with 16 threads each? If so, do you know if PETSc solvers will switch to a multithreaded run on each node?

If not, and if there are a total of 64 MPI processes (16 on each node) then I don't see how running multiple threads in libMesh would be helpful since they will be competing for the same resources on a node. On the same note, I am not clear how may threads libMesh will launch for such a run on each node. Maybe I am missing some connecting thought here?

jwpeterson commented 4 years ago

Thanks. I am still unclear about how well PETSc can benefit from a multi-threaded run. Do you know if its solvers can do both MPI and multi-threading at the same time?

Pretty sure they don't in general, but I seem to recall that Hypre BoomerAMG might use OpenMP? Either way, these questions would be better asked on their mailing lists.

I am not clear how may threads libMesh will launch for such a run on each node.

We use --n-threads command line option to control this. Additional threads are only active in threaded regions of the code, i.e. the parallel_for regions I mentioned earlier, they won't have any effect on a "typical" libmesh assembly loop, for example, since those are generally not threaded.

Given 4 nodes with 16 cores each, you should experiment to find out what combination of MPI+threads gives you the best results. In general I think you wouldn't want to "oversubscribe" the number of available cores, so you could try 64x1, 32x2, 16x4, etc. where the numbers are "number of MPI processes" and "number of threads", respectively. If your application spends much time in threaded regions you should see some benefit to adding threads, but if it doesn't (i.e. if most time is already spent in PETSc) then obviously it won't help much to add threads.

roystgnr commented 4 years ago

Given 4 nodes with 16 cores each, you should experiment to find out what combination of MPI+threads gives you the best results.

IIRC, from a strict CPU time perspective, "64 ranks with 1 thread each" is what I would expect to win here. In the past we've always been limited by PETSc (where some of their third party preconditioners were multithreaded but if you depended more on the Krylov loop you wouldn't get good threading scalability), and that's caused a bit of a chicken-and-egg problem in libMesh itself: not enough of us use threads, so not enough effort goes into optimizing using threads, so threaded scalability isn't as good as MPI scalability, so we have little incentive to start using threads more, goto 10. It's been years since I looked at the issue though, so it's possible for example that PETSc is doing great with threading now and we need to put in some work to catch up.

On the other hand, memory use can be a great reason to use threads. Even if you're on a DistributedMesh you can save a little memory by using fewer ranks (and hence fewer layers of ghost cells), and if your code only supports ReplicatedMesh then a 4x16 run will require far less RAM than a 64x1 run.

dknez commented 4 years ago

I think PETSc's answer on this issue is here: https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html

"The core PETSc team has come to the consensus that pure MPI using neighborhood collectives and the judicious using of MPI shared memory (for data structures that you may not wish to have duplicated on each MPI process due to memory constraints) will provide the best performance for HPC simulation needs on current generation systems, next generation systems and exascale systems. It is also a much simpler programming model then MPI + threads (leading to simpler code)."

roystgnr commented 4 years ago

I think, sadly, this is resolved about as well as the debate here can be.

libMesh / libmesh

Multithreading with MPI #2575