Open roystgnr opened 6 years ago
Oh, almost forgot one!
Actually, @fdkong has been looking at this very issue recently and we DO think it's a bigger issue than you might expect. It really starts to matter when you are running a problem with several DOFs per node (e.g. phase_field in 3D so like ~30), combined with suboptimal partitions that we get from Metis and wham! Our scalability goes in the toilet when we get up into the ~1000 processor ranges at ~20K DOFs per proc. A few dozen extra elements on a partition with all of those extra nodes assigned to the same processor leads to load imbalanced in the neighborhood of 10%. I believe Fande has some hard numbers to back this up.
We are applying for an LDRD to take a look at new partitioners and also the possibility of tuning the assignment of nodes to procs.
Actually no, it's way worse, I just got done talking to Fande. On the test case he ran the load imbalance between rank 0 and the smallest rank was a ratio of 1.9!
And that's with several hundred elements per proc? Ok, I'll try and put together a better node assignment option ASAP, because that ought to be easy, but it sounds like fixing partitioning is a much more important problem, even though that ought to be hard.
It would be great if you could smartly do a node assignment, @roystgnr. The imbalance is worse than we expect based on the naive node assignment. It is the scaling bottleneck even for a few hundreds of processing cores.
Putting together a better node assignment strategy is actually pretty tough, but we'd definitely be interested in hearing your ideas. Fande chatted with Barry Smith about this issue to see what other PETSc users do. He mentioned "random" is actually not terrible, which I thought was interesting. I'm sure we can gain some insight by digging into the literature as well. Fande, can you post your numbers here from what you've found?
Parallelism:
Num Processors: 128
Num Threads: 1
Mesh:
Parallel Type: distributed
Mesh Dimension: 3
Spatial Dimension: 3
Nodes:
Total: 2464461
Local: 22167
Local Min: 16751
Local Max: 22167
Node Ratio: 1.32332
Elems:
Total: 2400000
Local: 18751
Local Min: 18700
Local Max: 18800
Element Ratio: 1.00535
Num Subdomains: 1
Num Partitions: 128
Partitioner: parmetis
Nonlinear System:
Num DOFs: 22180149
Num Local DOFs: 199503
Local DOFs Min: 150759
Local DOFs Max: 199503
DOF Ratio: 1.32332
Variables: { "gr0" "gr1" "gr2" "gr3" "gr4" "gr5" "gr6" "gr7" "gr8" }
Finite Element Types: "LAGRANGE"
Approximation Orders: "FIRST"
Auxiliary System:
Num DOFs: 12064461
Num Local DOFs: 97171
Local DOFs Min: 91551
Local DOFs Max: 97171
DOF Ratio: 1.06139
Variables: "bnds" { "unique_grains" "var_indices" "ghost_regions" "halos" }
Finite Element Types: "LAGRANGE" "MONOMIAL"
Approximation Orders: "FIRST" "CONSTANT"
Parallelism:
Num Processors: 256
Num Threads: 1
Mesh:
Parallel Type: distributed
Mesh Dimension: 3
Spatial Dimension: 3
Nodes:
Total: 2464461
Local: 11486
Local Min: 7737
Local Max: 11486
Node Ratio: 1.48455
Elems:
Total: 2400000
Local: 9504
Local Min: 8840
Local Max: 9841
Element Ratio: 1.11324
Num Subdomains: 1
Num Partitions: 256
Partitioner: parmetis
Nonlinear System:
Num DOFs: 22180149
Num Local DOFs: 103374
Local DOFs Min: 69633
Local DOFs Max: 103374
DOF Ratio: 1.48455
Variables: { "gr0" "gr1" "gr2" "gr3" "gr4" "gr5" "gr6" "gr7" "gr8" }
Finite Element Types: "LAGRANGE"
Approximation Orders: "FIRST"
Auxiliary System:
Num DOFs: 12064461
Num Local DOFs: 49502
Local DOFs Min: 43915
Local DOFs Max: 49686
DOF Ratio: 1.13141
Variables: "bnds" { "unique_grains" "var_indices" "ghost_regions" "halos" }
Finite Element Types: "LAGRANGE" "MONOMIAL"
Approximation Orders: "FIRST" "CONSTANT"
Parallelism:
Num Processors: 512
Num Threads: 1
Mesh:
Parallel Type: distributed
Mesh Dimension: 3
Spatial Dimension: 3
Nodes:
Total: 2464461
Local: 5982
Local Min: 3337
Local Max: 5982
Node Ratio: 1.79263
Elems:
Total: 2400000
Local: 4895
Local Min: 3983
Local Max: 4952
Element Ratio: 1.24328
Num Subdomains: 1
Num Partitions: 512
Partitioner: parmetis
Nonlinear System:
Num DOFs: 22180149
Num Local DOFs: 53838
Local DOFs Min: 30033
Local DOFs Max: 53838
DOF Ratio: 1.79263
Variables: { "gr0" "gr1" "gr2" "gr3" "gr4" "gr5" "gr6" "gr7" "gr8" }
Finite Element Types: "LAGRANGE"
Approximation Orders: "FIRST"
Auxiliary System:
Num DOFs: 12064461
Num Local DOFs: 25562
Local DOFs Min: 20105
Local DOFs Max: 25610
DOF Ratio: 1.27381
Variables: "bnds" { "unique_grains" "var_indices" "ghost_regions" "halos" }
Finite Element Types: "LAGRANGE" "MONOMIAL"
Approximation Orders: "FIRST" "CONSTANT"
I could not find the data for 1024 cores. The ratio of nodes is 1.79
when using 512
processing cores. This indicates that we may like to care about the node assignment.
Are any of our Civet servers set up to do MPI-1 compatibility testing? I'm in a "get rid of legacy support for stuff that was superceded long long ago" mood, but at https://computing.llnl.gov/tutorials/mpi/#LLNL I notice that there are still National Labs supercomputers where an MPI-1 stack is the default!
I can't imagine anyone is still actually using an MPI-1 only cluster...
I would like to see some timing showing that using MPI_PROBE
to resize and receive is slower. It is the canonical way to do it. Not only that - but for asynchronous there is almost no other option (you wouldn't want to asynchronously receive two messages to actually get one).
Anyone seen a good test out there that shows this?
I can't imagine anyone is still actually using an MPI-1 only cluster...
I suppose even those LLNL systems still have MPI-2 available with non-default modules.
Would anyone else object to me just dropping MPI-1 support entirely? I could swear we had users of it at one point, but I can no longer recall how - mpich2 and openmpi were at least MPI-2 from the first release, and the mpich-1.2.7 install I built to test #1674 doesn't even let me build a modern PETSc, which screams at configure time when it can't find an mpiCC or mpicxx command.
No problems from me.
Derek
On Mon, Apr 30, 2018 at 12:59 PM roystgnr notifications@github.com wrote:
I can't imagine anyone is still actually using an MPI-1 only cluster...
I suppose even those LLNL systems still have MPI-2 available with non-default modules.
Would anyone else object to me just dropping MPI-1 support entirely? I could swear we had users of it at one point, but I can no longer recall how - mpich2 and openmpi were at least MPI-2 from the first release, and the mpich-1.2.7 install I built to test #1674 https://github.com/libMesh/libmesh/pull/1674 doesn't even let me build a modern PETSc, which screams at configure time when it can't find an mpiCC or mpicxx command.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/libMesh/libmesh/issues/1617#issuecomment-385495798, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1JMSsCBUcRMTTF0sA-u0ahrlwFsMgAks5tt18dgaJpZM4ShEfK .
I'm guessing no problems from anyone else either.
It turns out that I broke MPI-1 support in 640da95, via a cut-and-paste error, and I didn't catch it because I accidentally tested against an MPI-2 library when I thought I was testing against MPI-1, and in the year and a half since, nobody else has run into the compile-time bug.
I'll rejigger the MPI-2 feature addition in #1674 and use that PR to get rid of MPI-1 support too.
MPI_Waitany() wants an array of MPI_Request objects, and Request is too heavyweight a shim to make that directly possible, so there's a copy step involved. And doing waitany() in a loop requires getting rid of completed requests from previous loop iterations, which isn't well-suited to our typical vector
A few MPI-performance-optimization factoids I've run into over the last week:
@bboutkov reports that, although #1603 is also a huge win on his cluster, #1600 has exactly the opposite effect as it does on PECOS and INL: a slight slowdown rather than a slight speedup! There's also a lot of variance in timings from run to run.
https://www.clustermonkey.net/MPI/mpi-the-top-ten-mistakes-to-avoid-part-1.html lists the "post an MPI_PROBE and then use the size that is returned to allocate a buffer of the correct size and then MPI_RECV into it" as a mistake, saying it often forces allocation of an otherwise unnecessary temporary buffer for the entire message, and recommending our original naive "send a length message first" idiom instead!
https://www.clustermonkey.net/MPI/mpi-the-top-ten-mistakes-to-avoid-part-2.html lists our use of MPI_ANY_SOURCE as another common mistake, claiming it can cause thread contention and unnecessary system calls which can be avoided by using MPI_Waitany() instead.
I'm not sure any of this means we should be changing our code to accommodate, but I am now sure we need some more systematic way of comparing performance. I'm currently leaning toward using MOOSE, reworking something like test/tests/functions/image_function/threshold_adapt_parallel.i to do a bunch of DistributedMesh adaptivity into some initial condition function, and calling that our "performance standard", but I'm open to other suggestions.